Some questions about MetaPhlAn4 profiling results and phylogenetic tree

In the MetaPhlAn4 species name file, I have noticed that some SGB IDs have multiple corresponding species, such as SGB10068, which is associated with dozens of species. However, in my profiling results, the annotation for SGB10068 is “k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli|t__SGB10068”. This raises a few questions:

  1. When annotating an SGB, does MetaPhlAn4 select the most likely correct species while disregarding other species? If so, what criteria does it use for selecting the annotation?
  2. When an SGB corresponds to multiple species, does the species annotation vary among different samples?

In addition, I have noticed that some entries from the species name file, such as EUK5661, are missing from the phylogenetic tree generated by MetaPhlAn4 MetaPhlAn/mpa_vOct22_CHOCOPhlAnSGB_202212.nwk at master · biobakery/MetaPhlAn · GitHub. Conversely, there are also leaf nodes in the tree, like 45766:0.0056838754, that do not exist in the species name file. Does this indicate an error?

Hi @Qejyu
Answering your questions:

  1. For the SGBs containing reference genomes with multiple taxonomic names, metaphlan choses the taxonomy by majority rule, i.e. the taxonomy with more genomes. You can have more insights in the methods section of the mpa4 manuscript Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4 | Nature Biotechnology
  2. No, it will produce the same annotation independently of the samples

Unfortunately, in the metaphlan phylogenetic tree we did not include microeuk species as we built the tree using 200 universial bacterial and archeal conserved genes. For the additional nodes, could you share with me the full list species in the tree not in the db?

Thank you for your response.
During the species annotation of the nodes in the phylogenetic tree, I mapped the species annotations from the database to the corresponding nodes in the phylogenetic tree by removing the “SGB” label. However, I noticed that 140 nodes were not annotated. I have attached them in the appendix.
node.txt (956 Bytes)