I hope other MetaPhlAn users or developers could comment on how they interpret the community profiling results expressed in terms of the species-level genome bin (SGB) taxonomic system used by MetaPhlAn4.
I am curious because I encountered the following situation while using MetaPhlAn4 (v4.0.3 with the mpa_vJan21_CHOCOPhlAnSGB_202103 database) to profile the taxa in a set of metagenomes I am working with. MetaPhlAn4 indicates SGB9458 is the dominant SGB in these metagenomes. Supplementary Table 1 from the MetaPhlAn4 paper indicates SGB9458 contains 3 reference genomes and 15 reconstructed genomes. Because 2 of the 3 reference genomes are identified as Neisseria flavescens on NCBI, SGB9458 was assigned the species-level taxonomy “s__Neisseria_flavescens.” However, one of the two N. flavescens reference genomes is flagged as failing the taxonomy check on NCBI, and the third reference genome was sequenced from the N. subflava type strain. It, therefore, seems plausible that one or both N. flavescens reference genomes could be incorrectly identified N. subflava genomes. If this was the case, MetaPhlAn4’s taxonomic assignment scheme would dictate that SGB9458 be assigned the taxonomy “s__Neisseria_subflava.”
This case led me to wonder whether the best practice when interpreting MetaPhlAn4 results is to (A) manually check each SGB in your results to determine whether the automatically assigned taxonomies seem reasonable or (B) use MetaPhlAn4’s taxonomic designations when you present your results but ensure your audience knows that MetaPhlAn4 uses a different taxonomic system than either NCBI or GTDB.
We transform the output of MetaPhlAn4 (SGBs) to the corresponding GTDB taxonomy. It’s my understanding the Huttenhower group ran all their SGBs through the GTDB-tk to obtain the GTDB taxonomy. In my opinion, GTDB is a gold standard taxonomy assignment, but of course, there is debate.
I have recently encountered a problem while analyzing my metagenomic data. I have noticed that the taxonomic classification at the species level contains a large number of SGBs. In particular, many of these SGBs have names in the format of “ggbxxxx_sgbxxxx” and do not contain any information on genus. I am not sure how to handle these SGBs for downstream analysis.
I would appreciate it if you could provide me with some guidance on how to proceed. Should I first identify all of the SGB-containing sequences and convert them to GTDB? Can I then merge this information with the portion of my data that already contains specific species names for downstream analysis?
Actually, I am also faced with this problem. Some SGBs without information of genus or species level have high relative abundance. This makes me feel hard to conduct a clear taxonomic profiling.