MetaPhlAn 4.1 release (new virome database and SGB database update)
Announcement: We are pleased to share that MetaPhlAn 4.1 is now available, which includes a new module for viral profiling as described in this bioRxiv preprint (Discovering and exploring the hidden diversity of human gut viruses using highly enriched virome samples). This is enabled by an updated MetaPhlAn database version (vJun23), which also includes 6,272 more SGBs compared to the previous version (vOct22). A parallel release for HUMAnN 3 (v3.9) is also provided to maintain compatibility between MetaPhlAn 4’s updated taxonomic profiles and HUMAnN 3’s (non-viral) pangenome catalog.
For details on MetaPhlan 4, check announcing MetaPhlAn 4 or visit the MetaPhlAn 4 GitHub repository.
MetaPhlAn new viral database in a nutshell
In MetaPhlAn 4.1 we introduce a new option (–profile_vsc) that allows the user to profile a set of >162,000 viral sequences extracted from thousands of metagenomics assemblies. The database was built by leveraging 5,651 contigs from 255 viromes highly enriched for virus-like particles, that were then used to retrieve thousands more similar sequences from viromes and metagenomes. We validated the catalog by removing sequences mapping against binned MAGs (to discard sequences of clear bacterial origin), and kept only those sequences found multiple times in unbinned metagenomes (more likely to represent viruses).
The sequences were clustered into 3,944 VSCs (Viral Sequence Clusters), then further grouped into 1,345 Viral Sequence Groups (VSGs). A total of 45,872 representative viral sequences were then included in this MetaPhlAn 4.1 release. Each cluster/group is labeled as known (kVSG) or unknown (uVSG) depending on the presence of at least a viral RefSeq reference genome within the cluster. A set of 45,872 sequences that are representative of each viral group are now incorporated in the MetaPhlAn database.
With this new optional command, MetaPhlAn 4.1 will report the breadth of coverage of each Viral Sequence Group (VSG), in a separated file and in parallel with the standard analysis, without the need for an additional mapping.
Additionally, we derived potential virus-host associations by mapping the viral sequences against CRISPR spacers derived from 546,457 microbial genomes and MAGs, further validating the phagic nature of the majority of the sequences in the catalog (>90% of the viral groups were assignable to a microbial host) and providing potential phage-host associations together with the viral profiling module.
What is new in MetaPhlan 4.1
- Ability to profile viruses leveraging a custom viral sequence database. For details on the usage of the new MetaPhlAn 4 viral module check the tutorial.
- Possibility of subsampling reads on the fly during the MetaPhlAn run. For a usage example check the tutorial.
- Several StrainPhlAn (available within MetaPhlAn) improvements (now StrainPhlAn 4.1), including faster execution times for small/medium trees, faster –print_clades_only mode, and simplified sample/marker filtering parameters (check the change log for the full list).
What has changed in vJun23 in comparison to vOct22
Expansion of the genomic database
- ~45k reference genomes from NCBI
- ~50k MAGs from ocean
- ~40k MAGs from soil
- ~30k MAGs from domestic animals and non-human primates
- ~4k MAGs from giant turtles
- ~7.5k MAGs from skin microbiome
- ~20k MAGs from dental plaque
- ~15k MAGs from Asian populations
- ~2.7k MAGs from ancient and modern Bolivians
- other small datasets from diverse sources
Expansion of the markers database
- vJun23 now includes 36,822 SGBs
- 6,272 more SGBs than in vOct22
How to make use of the MetaPhlAn 4.1 updates
- How to install MetaPhlAn 4.1 in a new environment:
- How to upgrade the database from the previous vOct22 version:
- $ metaphlan --install --force_download