Missing bacterial species in Metaphlan4

Dear Metaphlan4 developers,

Thanks for the fantastic tool.

I was looking into the SGB sequences for some specific species (focusing on potential pathogenic ones), but find many of them don’t exist in the final SGB database. However, I can find many of those species in the supplementary Table S1. I would assume those are medium-quality SGBs so they got excluded from the final high-quality SGB database (n=26,970). However, some of them are clinically important, such as E. albertii, Bacillus anthracis, and so on. Also, I realized some species like Shigella sonnei and Shigella dysenteriae were classified into the SGB (SGB10068) that E. coli was selected as the representative.

My question are 1) is my observation correct, or have I misinterpreted that? 2) how could we still get the medium-quality SGBs (are they available for download?) or is there some way we could incorporate new additional marker sequences for additional species? As I found high abundances of E. albertii with Metaphlan2 in some samples but they are completely missed in Metaphlan4. 3) For some of those species who have multiple SGBs (F. prausnitzii and Ruminococcus flavifaciens), should we view them as different species (not-yet-named) for different SGBs? Can we consider the SGB with the highest number of reconstructed genomes as the dominant SGB for this species?

Hi @Cheng_Guo
Answering your questions:

  1. Partially. For cases like B. anthracis, in which it is impossible to differentiate using the core genes from the other species in the B. cereus group, MetaPhlAn won’t be able to distinguish between the specific species of the group. Similar scenario happens with S. sonnei and dysenteriae, genetically close to E. coli (>95% ANI) so they cluster together in the same SGB.
  2. As the low quality SGBs were discarded before or during the marker genes generation steps, it is not possible to retrieve the data for profiling.
  3. For species that span multiple SGBs (F. praus and F. nucl as examples) you should consider each SGB as a different (sub-)species