I have a question regarding
phylophlan_metagenomic and its algorithm to assign newly assembled genomes to known reference genomes or previously published MAGs. I was wondering whether there is a way that allows the user to identify newly assembled genome bins that have conflicting assignments to previously published bins, i.e. multiple genomes that have a very similar MASH distance but are, however, taxonomically very far apart.
As an example, I recently assembled a medium quality genome bin with 84% completeness and 2.7% contaminations estimated by checkM. CheckM’s lineage assignment places the genome bin to the genus Prevotella. When running
phylophlan_metagenomic to identify its genetically closest published genome, it is assign to Prochlorococcus SGB48363. Since in this case, the distance is < 5% and I selected
--add-fgb, both GGB and FGB were assigned to genus and family of the Prochlorococcus SGB and no further results were reported. However, there are 7 SGBs in PhyloPhlAn’s database that have a distance of < 5% to this bin, ranging of 4.1-4.7%. From these 7 SGBs, two were Prochlorococcus strains (4.1%, 4.3%), then there were two Prevotella strains (1626: 4.2%, 1632: 4.6%), two unknown SGBs assigned to phylum Bacteroidetes (4.4% and 4.5%), and a Microbacterium strain (4.5%). In conclusion, the assembled SGB was closely related to 4 taxa from the phylum Bacteroidetes, 2 from the phylum Cyanobacteria, and 1 from the phylum Actinobacteria.
Is there a programmatic way already implemented in
phylophlan_metagenomic to return the lowest common taxonomic rank and flag such genome bins, whose assignments can be found all over the tree of life?