Dear Metaphlan4 developers,
Thanks for the fantastic tool.
I was looking into the SGB sequences for some specific species (focusing on potential pathogenic ones), but find many of them don’t exist in the final SGB database. However, I can find many of those species in the supplementary Table S1. I would assume those are medium-quality SGBs so they got excluded from the final high-quality SGB database (n=26,970). However, some of them are clinically important, such as E. albertii, Bacillus anthracis, and so on. Also, I realized some species like Shigella sonnei and Shigella dysenteriae were classified into the SGB (SGB10068) that E. coli was selected as the representative.
My question are 1) is my observation correct, or have I misinterpreted that? 2) how could we still get the medium-quality SGBs (are they available for download?) or is there some way we could incorporate new additional marker sequences for additional species? As I found high abundances of E. albertii with Metaphlan2 in some samples but they are completely missed in Metaphlan4. 3) For some of those species who have multiple SGBs (F. prausnitzii and Ruminococcus flavifaciens), should we view them as different species (not-yet-named) for different SGBs? Can we consider the SGB with the highest number of reconstructed genomes as the dominant SGB for this species?
Thank you in advance.
Best,
Cheng