The bioBakery help forum

Quantifying uncertainty the level of uncertainty in phylophlan_metagenomic


I have a question regarding phylophlan_metagenomic and its algorithm to assign newly assembled genomes to known reference genomes or previously published MAGs. I was wondering whether there is a way that allows the user to identify newly assembled genome bins that have conflicting assignments to previously published bins, i.e. multiple genomes that have a very similar MASH distance but are, however, taxonomically very far apart.

As an example, I recently assembled a medium quality genome bin with 84% completeness and 2.7% contaminations estimated by checkM. CheckM’s lineage assignment places the genome bin to the genus Prevotella. When running phylophlan_metagenomic to identify its genetically closest published genome, it is assign to Prochlorococcus SGB48363. Since in this case, the distance is < 5% and I selected --add-ggb and --add-fgb, both GGB and FGB were assigned to genus and family of the Prochlorococcus SGB and no further results were reported. However, there are 7 SGBs in PhyloPhlAn’s database that have a distance of < 5% to this bin, ranging of 4.1-4.7%. From these 7 SGBs, two were Prochlorococcus strains (4.1%, 4.3%), then there were two Prevotella strains (1626: 4.2%, 1632: 4.6%), two unknown SGBs assigned to phylum Bacteroidetes (4.4% and 4.5%), and a Microbacterium strain (4.5%). In conclusion, the assembled SGB was closely related to 4 taxa from the phylum Bacteroidetes, 2 from the phylum Cyanobacteria, and 1 from the phylum Actinobacteria.

Is there a programmatic way already implemented in phylophlan_metagenomic to return the lowest common taxonomic rank and flag such genome bins, whose assignments can be found all over the tree of life?


Hello Alex, and thanks for your question.

The TL;DR is yes and no, meaning that phylophlan_metagenomic will report the closest, but won’t flag it.

The long version.
When you run phylophlan_metagenomic with the --add-ggb and --add-fgb params if an SGB is assigned to the input genome(s) the GGB and FGB reported will be the ones of the assigned SGB and those will not be calculated. This to avoid inconsistencies in cases like yours, where you might have a genome that is <5% MASH distance to more than one SGB, and in some cases, it could be closer to a GGB (and.or FGB) of another SGB(s). In this case, the phylophlan_metagenomic approach is to report the closest.

If you run phylophlan_metagenomic without the --add-ggb and --add-fgb, you’ll get (by default 10) a list of the closest SGBs sorted according to their MASH distance. As I believe you did to be able to report the several other SGBs you found to be <5% MASH distance with your input genome.

The above-described inconsistencies could happen for many different reasons which we can summarize that all depends on the content of the SGB in the current release. I don’t know which SGB release you used, but there is more than one SGB release that you can use in phylophlan_metagenomic and each newer release is incremental w.r.t. the previous one. Incremental means that a more recent SGB release will potentially define new SGB(s) and contains more genomes and/or MAGs in the previously defined SGB clusters. This, in some cases, could be helpful to narrow the assignment as the addition of more genomes and/or MAGs can impact the average MASH distance of your input genome(s) against those in the SGBs (which in many cases can be due to the presence of only a single genome in the SGB).

My suggestions are:

  • if you didn’t used the latest SGB release available, please try to re-run phylophlan_metagenomic using the latest release
  • whether or not you used the latest release, you can retrieve the genomes and MAGs present in all the SGBs you found to be <5% from the SGB database (you can find an example to run a query there here), you can add your genome to this folder, and then you can use PhyloPhlAn to build a phylogeny for you to further investigate the phylogenetic relationships between these SGBs and your genome.
    Note: you can extend this by including all genomes for all SGBs in the GGBs of the closest (<5% MASH distance) SGBs you found, or even go to the FGB level. Of course, be careful because you can end up building a phylogeny with thousands of genomes (in which case if you really want to do it, you can exploit the CheckM scores for them as in the SGB release of the Pasoli et al (2019) paper (SGB.Jan19) to only select the highest quality ones.

Sorry for the very long message, I hope this can help you.

Many thanks,

Hi Francesco,

Thanks for the detailed reply. What you described above makes a lot of sense.

I ran the analysis with the latest release from July 2020. After running it with the options --add-ggb and --add-fgb, I went back to the source code and step through it in order to see the assignment of the next closest neighbours. There I then realised that I could have achieved the same by just running phylophlan_metagenomic without these options. :wink:

The validation that you describe using phylophlan to actually download all the closest species and building a tree for each candidate with inconsistencies like the one I described is indeed the most thorough approach but I would likely only use it to follow up on genome bins for which it seem to be of striking interest.

I thought that for a quick and dirty flagging of genome bins with large inconsistencies it might be simpler to just use the list of closest genome bins/reference genomes, extract their IDs from the database of phylophlan_metagenomic and run a lowest common ancestor algorithm on the taxonomic rank. I will see whether I am able to implement something like this without too much hassle.

Thanks again for the great reply!

1 Like