Too many SGB or other unclassified species when processing mouse metagenomic samples

Hello, I have recently been using the Metaphlan4 to analyze mice fecal metagenomic samples. However, I have encountered some issues and would like to seek your advice.

Taking a sample as an example, after parsing out the species, it was found that the total abundance became 70% after removing species that contain GGB and SGB. Then I noticed that there are many species that end with “aceae” and/or contain the word “bacterium”, which do not seem to be classified into specific species. After removing this part, the total abundance decreased to 36%. Furthermore, after consulting the literature, I found that many species containing “sp_xxx” cannot be explained in detail. Therefore, I removed this part as well, and the current total abundance is 27%.

Therefore, I would like to ask: 1. Is this method of filtering reasonable (mainly to remove unclassified species)? 2. If so, why does Metaphlan4 only identify so few effective species? 3. Is this because my sample is from a mouse rather than a human? 4. How can i solve this problem?

I hope to receive your message.

Best regards

I’ve been having the same issue with mice stool samples. Would appreciate any comments on this.


You finding a huge proportion of uncharacterized species in your MetaPhlAn 4 profiles is simply a reflection of the fact that you are studying an undercharacterized environment (with many uncultivated and therefore not many properly named species). Actually, MetaPhlAn 4 being able to identify such uncharacterized species should be considered an asset of the tool rather than a limitation. This capability is related to MetaPhlAn4 relying not only on (isolate) genomes from previously characterized species to build its marker gene database, but also on MAGs - genomes recovered de novo from metagenomic samples. You can read the details on MetaPhlAn 4’s approach in the MetaPhlAn 4 paper.

To answer your questions:

  1. I would advise against filtering out uncharacterized species, especially for an undercharacterized environment in which this portion of the community should be large. If you decide to remove such features from your data anyway, you should be aware that you’re basically removing species truly present in the community, so you should evaluate whether this can harm your ability to answer your research question before proceeding with this type of filtering.
  2. I disagree when you say MetaPhlan4 identify few “effective” species. What MetaPhlAn 4 does is exactly to identify the “effective” species (through the SGB system that you can read more about in the paper aforementioned), even when they don’t have a full taxonomic label attached to them. The fact that from all “effective” species identified (i.e., from all SGBs), few of them are fully named is more a general limitation of the field, which is lagging behind in characterization and naming of new microbes.
  3. Yes, the more undercharacterized from the microbiological point of view an environment is, more uncharacterized species you’ll have in a metagenomic sample from that environment. Conveniently, these species can now be profiled in MetaPhlAn 4. You can read more about how MetaPhlAn4 allows a more complete profiling of mice gut microbiomes in this paper.
  4. I understand this can be a little frustrating, but I don’t see it necessarily as a problem. Basically, what you’re finding is that your samples are full of microbial unknowns. In my opinion, that’s more a fact, a scientific result if you will, than a problem.

Hopefully this helps.


Thanks for this post. Very helpful.

My question now is, why do I get so many more SGBs (“unannotated”) when using the Chocophlan database (sample source: human) compared to when the annotation is converted to GTDB? Is the Chocophlan database being (maybe too) conservative in its annotations? Or is the GTDB database the better curated one? What is the reason for this discrepancy exactly?

And maybe a bolder question, but since a lot of us using Metaphlan are looking for taxonomic annotations (and not SGB-numbers), wouldn’t it be better if the default output would be GTDB-annotated? Or to ask the question somewhat differently, why should one use the SGB-output over the GTDB-output?

Thanks in advance!