MetaPhlAn 4 published + database update

Announcement: We are pleased to share that MetaPhlAn 4 is now published open-access in Nature Biotechnology (Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4 | Nature Biotechnology). We are also releasing an updated database version (vOct22) with 3,580 more SGBs compared to the current and published version (vJan21).

MetaPhlAn 4 in a nutshell

The new MetaPhlAn 4 integrates information from both metagenome assemblies and microbial isolate genomes for improved and more comprehensive metagenomic taxonomic profiling (Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4 | Nature Biotechnology). From a curated collection of 1.01M prokaryotic reference and metagenome-assembled genomes, we define ~5M unique marker genes for 26,970 species-level genome bins (SGBs), 4,992 of them taxonomically unidentified at the species level (uSGBs). MetaPhlAn 4, when compared to version 3, explains ~20% more reads in most international human gut microbiomes and >40% in less-characterized environments such as the rumen microbiome. MetaPhlAn 4 also proves to be more accurate than available alternatives on synthetic evaluations while also reliably quantifying organisms with no cultured isolates, a major improvement for the field. Application of the method to >24,500 metagenomes highlights previously undetected species that are strong biomarkers for host phenotypes in human and mouse microbiomes and, thanks to StrainPhlAn 4, shows that even previously uncharacterized species can be genetically profiled at the resolution of single microbial strains. MetaPhlAn 4 integrates the novelty of metagenomic assemblies with the sensitivity and fidelity of reference-based analyses, providing efficient metagenomic profiling of uncharacterized species and enabling deeper and more comprehensive microbiome analyses.

For more detailed information about MetaPhlAn 4, please visit Announcing MetaPhlAn 4.

An updated markers database is now available!

Further, we have already updated the currently published MetaPhlAn 4 markers database (named vJan21) with the addition of more than 200k new genomes! The new vOct22 database is now able to profile 3,580 more SGBs than the vJan21, keeping a similar performance on synthetic evaluations in comparison to the previous version.

What has changed in vOct22 in comparison to vJan21:

Genomic database:

  • Inclusion of new MAGs
    • ~10k MAGs from food-related sources
    • ~35k MAGs from the oral cavity
    • ~50k MAGs from the mice gut
    • ~30k MAGs environmental MAGs
    • ~1k archaeal MAGs
    • ~3k MAGs from long-read sequencing
    • additional public MAGs from diverse sources (ocean, animals, skin, etc)
  • 2,548 genomes considered reference genomes in vJan21 were relabelled as MAGs in NCBI
    • 1,550 kSGBs in vJan21 are now uSGBs in vOct22 (515 with less than 5 MAGs in vOct22 and thus discarded)
      Removed redundant reference genomes from the vJan21 genomic database using a MASH distance threshold at 0.1%
  • Local reclustering to improve SGB definitions of oversized or too-close SGBs
  • Improved GGB and FGB definitions by reclustering SGB centroids from scratch
  • Improved phylum assignment of SGBs with no reference genomes at FGB level using MASH distances on amino acids to find the closest kSGB

Markers database

  • The new vOct22 now includes 30,550 SGBs
    • 3,580 more SGBs than in vJan21
  • The new vOct22 has drastically reduce the number of SGB groups, i.e. groups of really close SGBs profiled together as an unit
    • Only 90 SGB groups in comparison to the 237 in vJan21
  • We introduced a new procedure to trim markers and only keep fragments that are unique and long enough
    • When mapped to check for uniqueness, the core genes are first split into “windows” of length 150 nt. Using this mapping, we now discard windows that are not unique and reduce/split markers into sufficiently long (>450 nt) chains of unique windows

What is missing:

  • The GTDB taxonomic assignment for the vOct22 database is not available yet (expected release: end of Feb 2023)
  • The phylogenetic tree of life for the vOct22 database is not available yet (expected release: TBD).

How does the new vOct22 profiling compare to vJan21?

Evaluation of the MetaPhlAn 4 vOct22 database using the CAMI II synthetic metagenomes. We evaluated the latest vOct22 database using the 128 synthetic metagenomes representing host-associated communities from the CAMI II taxonomic profiling challenge. We compared the performance of the vOct22 with the previous vJan21 database as well as the available alternatives assessed in the recently published MetaPhlAn 4 work (Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4 | Nature Biotechnology). GIT = Gastrointestinal tract, UT = Urogenital tract.

Evaluation of the MetaPhlAn 4 vOct22 in comparison with the previous vJan21 database using real metagenomes from multiple sources. We evaluated the SGB richness reported by MetaPhlAn 4 using the original vJan21 and the latest vOct22 databases on real metagenomic samples from diverse environments. The number of kSGBs per sample tended to decrease due to 1,550 kSGBs in vJan21 that were relabelled as uSGBs in vOct22. NHP = Non-human primates, W = Westernized, NW = non-Westernized, F. food = Fermented food.

How to perform a fresh install of MetaPhlAn 4

1 Like

The GTDB taxonomic assignment for vOct22 is already available for download here: MetaPhlAn/mpa_vOct22_CHOCOPhlAnSGB_202212_SGB2GTDB.tsv at master · biobakery/MetaPhlAn · GitHub
It will be included by default within the MetaPhlAn package in the next version (4.0.6)

Hi, and thanks for your great software, including the update to MetaPhlan 4. It works wonderfully on our Illumina miseq data.

I was wondering if nanopore (r10.4, q20+) metagenomic data is supported?

When we try to use Metaphlan4 to get a taxonomic profile of our metagenome, we run out of memory. I have dropped the number of cores to 1, and subsetted my reads down to mi-seq size, ~15 gig. I still max out the memory I have (128 gig). I think this because bowtie2 isn’t really optimized for the longer reads…not sure.

Hi @danchurch
Unfortunately, the current version of MetaPhlAn 4 does not support (and has never been tested) on long-reads metagenomics data.

An advantage of the database containing metagenome-assembled genomes is that the increase in completeness would mean an increase in accuracy of estimated proportions. However, a disadvantage would be that the results would be less comparable to 16S sequencing data, since most metagenome-assembled genomes do not have the 16S gene sequence determined.

Is this database supporting viruses and eukaryotes? I have samples that previously had eukaryote and virus content with Metaphlan3 that I do not see in Metaphlan4. Thanks.

Hi @Theo_Allnutt
The new database contains the same eukaryotic markers as in metaphlan 3, but it does not include yet viral markers.

Could you release the Phylogenetic tree corresponding to the Oct database please ?

Humann 3.6.1 is not compatible with Oct22 is there or will there be soon a version that accepts this db?

Hi @SilasK
We just finished today the reconstruction of the oct22 phylogeny. We are currently manually checking it to see if we find any inconsistencies but we should be able to release it by the end of the week. I will keep you updated. We are also currently working in making humann3.6 compatible with the last version.

1 Like

The Oct22 phylogeny is already available here: MetaPhlAn/mpa_vOct22_CHOCOPhlAnSGB_202212.nwk at master · biobakery/MetaPhlAn · GitHub

1 Like

Thanks for the updated metaphlan database vOct22. I assume the marker-to-species file metaphlan/utils/mpa_vOct22_CHOCOPhlAnSGB_202212_SGB2GTDB.tsv is based on gtdb r207annotations. Is this correct?

A slightly difference question: any chance the GCA/GCF ids of the SGBs representatives are stored in an accessible file. I’m asking because I’m trying to map across gtdb versions. The mpa_vOct22_CHOCOPhlAnSGB_202212_SGB2GTDB.tsv file only contains the SGB linked to taxonomy.

Hi @sararp
Unfortunatelly, the sequences of the genomes used for building the metaphlan databases are not available yet to be publicly downloaded

Are they available for the older versions of the metaphlan database? Having a hard time tracking down genomes for the database with metaphlan v3.1

Hi, we know that the different clade names in the abundance level table obtained by Metaphlan contain unique SGB IDs, is it possible for me to fetch the corresponding reference sequences (.fna files) in the SGB database by using these IDs? Or is there any other way to find the standard sequence referenced by Metaphlan? Because I want to do further analysis using my own sequenced sequences and reference sequences.