MetaPhlAn 4.1 release (new virome database and SGB database update)

Announcement: We are pleased to share that MetaPhlAn 4.1 is now available, which includes a new module for viral profiling as described in this bioRxiv preprint (Discovering and exploring the hidden diversity of human gut viruses using highly enriched virome samples). This is enabled by an updated MetaPhlAn database version (vJun23), which also includes 6,272 more SGBs compared to the previous version (vOct22). A parallel release for HUMAnN 3 (v3.9) is also provided to maintain compatibility between MetaPhlAn 4’s updated taxonomic profiles and HUMAnN 3’s (non-viral) pangenome catalog.

For details on MetaPhlan 4, check announcing MetaPhlAn 4 or visit the MetaPhlAn 4 GitHub repository.

MetaPhlAn new viral database in a nutshell

In MetaPhlAn 4.1 we introduce a new option (–profile_vsc) that allows the user to profile a set of >162,000 viral sequences extracted from thousands of metagenomics assemblies. The database was built by leveraging 5,651 contigs from 255 viromes highly enriched for virus-like particles, that were then used to retrieve thousands more similar sequences from viromes and metagenomes. We validated the catalog by removing sequences mapping against binned MAGs (to discard sequences of clear bacterial origin), and kept only those sequences found multiple times in unbinned metagenomes (more likely to represent viruses).

The sequences were clustered into 3,944 VSCs (Viral Sequence Clusters), then further grouped into 1,345 Viral Sequence Groups (VSGs). A total of 45,872 representative viral sequences were then included in this MetaPhlAn 4.1 release. Each cluster/group is labeled as known (kVSG) or unknown (uVSG) depending on the presence of at least a viral RefSeq reference genome within the cluster. A set of 45,872 sequences that are representative of each viral group are now incorporated in the MetaPhlAn database.

With this new optional command, MetaPhlAn 4.1 will report the breadth of coverage of each Viral Sequence Group (VSG), in a separated file and in parallel with the standard analysis, without the need for an additional mapping.

Additionally, we derived potential virus-host associations by mapping the viral sequences against CRISPR spacers derived from 546,457 microbial genomes and MAGs, further validating the phagic nature of the majority of the sequences in the catalog (>90% of the viral groups were assignable to a microbial host) and providing potential phage-host associations together with the viral profiling module.

What is new in MetaPhlan 4.1

  • Ability to profile viruses leveraging a custom viral sequence database. For details on the usage of the new MetaPhlAn 4 viral module check the tutorial.
  • Possibility of subsampling reads on the fly during the MetaPhlAn run. For a usage example check the tutorial.
  • Several StrainPhlAn (available within MetaPhlAn) improvements (now StrainPhlAn 4.1), including faster execution times for small/medium trees, faster –print_clades_only mode, and simplified sample/marker filtering parameters (check the change log for the full list).

What has changed in vJun23 in comparison to vOct22

Expansion of the genomic database

  • ~45k reference genomes from NCBI
  • ~50k MAGs from ocean
  • ~40k MAGs from soil
  • ~30k MAGs from domestic animals and non-human primates
  • ~4k MAGs from giant turtles
  • ~7.5k MAGs from skin microbiome
  • ~20k MAGs from dental plaque
  • ~15k MAGs from Asian populations
  • ~2.7k MAGs from ancient and modern Bolivians
  • other small datasets from diverse sources

Expansion of the markers database

  • vJun23 now includes 36,822 SGBs
    • 6,272 more SGBs than in vOct22

How to make use of the MetaPhlAn 4.1 updates

Thanks for releasing the MetaPhlAn 4.1. It will be a great improvement. Although when I am trying to find the virus from this version, so i am not able to find it. However with the Metaphlan 3.0, I can find 5% viruses in my sample.
I downloaded the database mpa_vJun23 using metaphlan download option. Then I am trying to find the viruses through these commands.
metaphlan Gut_MG0182_S52_L001_R1_001.fastq.gz --input_type fastq -o profiled_metagenome_3.txt --nproc 12 --bowtie2db /mnt/d/metaphlan_4.1_db/ --add_viruses --mpa3 --profile_vsc --ignore_bacteria --ignore_archaea

metaphlan Gut_MG0182_S52_L001_R1_001.fastq.gz --input_type fastq -o profiled_metagenome_3.txt --nproc 12 --bowtie2db /mnt/d/metaphlan_db --add_viruses --mpa3 --profile_vsc --ignore_bacteria --ignore_archaea

metaphlan Gut_MG0182_S52_L001_R1_001.fastq.gz --input_type fastq -o profiled_metagenome.txt --profile_vsc --nproc 12 --bowtie2db /mnt/d/metaphlan_4.1_db/

I am successfully able to get the Archea and bacteria but I am not able to get the virus in the same fastq file where i am getting 5% viral reads from Metaphlan3.

Can you suggest me some options, if I am missing it.

HI @rohitshukla
You should ask the parameter --vsc_out to specify the output file of the viral profiling. Please, have a look at the tutorial here MetaPhlAn 4.1 · biobakery/biobakery Wiki · GitHub

Hi there,

Are SGB to GTDB profiles available for the latest database? I installed MetaPhlAn v4.0.6 and it installed the latest database, but I tried running the sgb_to_gtdb script and it didn’t work.


Hi @bluenote-1577
Installing the latests 4.1 version will already include the necessary files to be compatible with Jun23, but you can also find the translation files here: MetaPhlAn/metaphlan/utils/mpa_vJun23_CHOCOPhlAnSGB_202307_SGB2GTDB.tsv at master · biobakery/MetaPhlAn · GitHub
They should be included within the utils folder of your metaphlan installation

Hi there,

I noticed that the databases of mpa_vOct22_CHOCOPhlAnSGB_202403 and mpa_vJun23_CHOCOPhlAnSGB_202403 are uploaded closely, 2024-04-05 and 2024-03-11. Besides, the mpa_latest file is still mpa_vJun23_CHOCOPhlAnSGB_202307. So, which one version is the latest database? And what’s the difference between mpa_vJun23 and mpa_vOct22?


Sorry for the inconvenience, but how would the order of the command line I should use to identify viruses, hopefully you can help me.

I can’t understand the meaning of the instructions

Do the new VSCs only contain phage data? I looked at the the paper and tried looking at some of the metadata for the VSCs and can’t seem to find things like Norovirus so I wanted to ask.

If this is correct, is there a reason non-phage viruses were left out?

Hi @Sawyerxu

This question is releated to the 4.1.1 version of MetaPhlAn (see Announcing MetaPhlAn 4.1.1 release). The Jun23 database is still the latest, the databases that were uploaded (mpa_vJun23_CHOCOPhlAnSGB_202403 and mpa_vOct22_CHOCOPhlAnSGB_202403) are related to a fix in the taxonomic labels of some SGBs in either Jun23 or Oct22. However, the SGBs you will find in the updated databases are still the same. To update to the latest fixed taxonomy database you can use metaphlan --install --force_download

HI @Chris091089

The order of the parameters does not matter, you can see an example of how to use the viral profiling here: MetaPhlAn 4.1 · biobakery/biobakery Wiki · GitHub