I hope other MetaPhlAn users or developers could comment on how they interpret the community profiling results expressed in terms of the species-level genome bin (SGB) taxonomic system used by MetaPhlAn4.
I am curious because I encountered the following situation while using MetaPhlAn4 (v4.0.3 with the mpa_vJan21_CHOCOPhlAnSGB_202103 database) to profile the taxa in a set of metagenomes I am working with. MetaPhlAn4 indicates SGB9458 is the dominant SGB in these metagenomes. Supplementary Table 1 from the MetaPhlAn4 paper indicates SGB9458 contains 3 reference genomes and 15 reconstructed genomes. Because 2 of the 3 reference genomes are identified as Neisseria flavescens on NCBI, SGB9458 was assigned the species-level taxonomy “s__Neisseria_flavescens.” However, one of the two N. flavescens reference genomes is flagged as failing the taxonomy check on NCBI, and the third reference genome was sequenced from the N. subflava type strain. It, therefore, seems plausible that one or both N. flavescens reference genomes could be incorrectly identified N. subflava genomes. If this was the case, MetaPhlAn4’s taxonomic assignment scheme would dictate that SGB9458 be assigned the taxonomy “s__Neisseria_subflava.”
This case led me to wonder whether the best practice when interpreting MetaPhlAn4 results is to (A) manually check each SGB in your results to determine whether the automatically assigned taxonomies seem reasonable or (B) use MetaPhlAn4’s taxonomic designations when you present your results but ensure your audience knows that MetaPhlAn4 uses a different taxonomic system than either NCBI or GTDB.
-Anthony