The difference of building the databases of metaphlan2 and metaphlan3?

hi, dear developer!

I have run 30 real data samples both on metaphlan2 and metaphlan3. I found that they just sharing ~50% taxons on level Genus or on level Species, and there are many taxons that are only appeared in metaphlan2. So, I compared the of the database of metaphlan2 and metaphlan3, and I found that there are many reference genomes are removed in metaphaln3’s database.

And I find that most of the taxons appears in metaphlan2(not appear in metaphlan3) are removed in the database of metaphlan3. So, what’s the reason of removing these reference genomes?

The result of my analysis is pasted bellow:

Same thing is happening with me. When I compared V2 db vs V3 db of my result, the v2 shows fungal species and it disappears with v3.
I have asked the question but i haven’t got any reply yet.

Hi @Yong and @shreya ,
Thanks for getting in touch. From version 2 to version 3 there has been a huge update in terms of the species and genomes included in the MetaPhlAn database, you can have a look in the Methods section of the Biobakery 3 paper (specifically the section related to the ChocoPhlAn database) to have a better understanding of the improvements included: Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3 | eLife

Thanks for replying. But if its huge then v2, I wonder why am I missing Eukaryotes in my profile.txt file.
v2 shows it.

The update of the database adding new species also leads to changes in the species-specific marker genes of the database. As we add more species in the database, genes that could be marker genes in the v2, they may had been discovered as non-unique in v3. I’m not saying that this is specifically your case, but is one of the most common scenario happening in the cases like yours.

I get that. I have been doing R & D for my samples to look for fungal species using bead beater.
With v2, I was getting eukaryotes in all the samples where I used bead beater. As you said that v3 has expanded database, I thought I might get something more if I start using v3. And to my surprise, all eukaryotes disappeared and also viruses. So I tried v2 vs v3 for my controls. It gave me very weird results. Many species were missing when I used v3 while v2 gave satisfactory results. Therefore I went back to v2.
So my question is, when my target is eukaryotes, should I not use v3 and just keep using v2?

Both the viral and eukaryotic species that can be profiled using v2 are present in the v3 database. For the viral profiling, by default in v3, it is not activated. To do so, you should use the --add_viruses parameter. Regarding to the eukaryotes, I wouldn’t go back to v2. Instead, I would modify the parameters of v3 to be more sensitive. Specifically, the --stat_q parameter. This parameter is used when MetaPhlAn calculates the robust average coverage of a species (more details on the paper I shared above). In v2, this parameter was set up by default at 0.1 (trimming 10% of the markers distribution in both ends), but in v3 we decided to raise it to 0.2 (20%). This means that, for detecting a species, by default, in v2 you just need 10% of the markers to be present in the sample while in v3 you need 20% of them. Adding to your MetaPhlAn v3 execution the parameter --stat_q 0.1 might help in your case.

1 Like