Explanation of the output of phylophlan_metagenomic

Hello!

I am a bit confused about the output of phylophlan_metagenomic. For example, I have run it with a novel MAG and it gives me:

uSGB_48157:Family:k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Geodermatophilales|f__Geodermatophilaceae|g__GGB33989|s__GGB33989_SGB48157|t__SGB48157:0.0247779
uSGB_48156:Other:k__Bacteria|p__Bacteroidetes|c__CFGB14357|o__OFGB14357|f__FGB14357|g__GGB39511|s__GGB39511_SGB48156|t__SGB48156:0.276697
uSGB_48155:Genus:k__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Rhodobacterales|f__Rhodobacteraceae|g__Stappia|s__Stappia_SGB48155|t__SGB48155:0.295981

So the first one, the taxon level is “Family”, the second one “Other”, the third one “Genus” - what does this mean?

I guess the 0.0247779 means the MAG is about 2.5% MASH distance from uSGB_48157, are you telling me that this is a Family level assignment because that SGB only has a proper name down to Family level?

What does “Other” mean for the second result? This has a name at Phylum level.

Is there somewhere that explains the output of phylophlan_metagenomic?

Thanks
Mick

Hello Mick, thank you for your message.

We do have a part of the wiki trying to explain the output here: Home · biobakery/phylophlan Wiki · GitHub

But let me try to explain it again here as this could also be found by others in the future.

your output file should resemble the following:

my_bin	(k|u)SGB_ID:taxa_level:taxonomy:average_mash_distance	[(k|u)SGB_ID:taxa_level:taxonomy:average_mash_distance]

And you’ll find as many

(k|u)SGB_ID:taxa_level:taxonomy:average_mash_distance

columns as specified by the -n/--how_many param (10 is the default).

This list of columns are sorted by their average_mash_distance, so the first one will be the closest.

Now, for the other parts:

  • my_bin will be the name of your input genome/MAG
  • (k|u)SGB_ID: will tell you the SGB ID (ID) and k or u indicate whether it is a known (k) or an unknown (u) SGB. This follows the rule we used in this work, where kSGB are those that contain a reference genome deposited in public databases (filtered a bit, not just all genomes in NCBI) and uSGB will only contain MAGs.
  • taxa_level: can be either Species, Genus, Family, or Other, depending at which taxonomic level the SGB has been assigned to. Where Species will only be used for kSGBs practically because only those will have a taxonomic label assigned at the species level within the SGB. Genus, Family and Other are for the uSGB to indicate (to put it very simply) how “far” a reference genome is found. Genus means that the GGB that SGB belongs to contains a reference genomes and hence its taxonomic label is used up-to the genus level. Family similarly but the reasoning is done at the FGB level up-to the family taxonomic level. Other instead it means that both GGB and FGB assigned to that SGB are both unknown (hence an uGGB and uFGB, respectively). In this case, we report it as Other and the taxonomic label is retrieved by taking the one assigned to the closest reference genome.
  • taxonomy: is the full taxonomic label assigned to the SGB
  • average_mash_distance: is the average Mash distance of the input bin w.r.t. all the genomes in the SGB. Like in your case, about 2.5% average Mash distance from all the MAGs in the uSGB_48157.

In your case, both uSGB_48156 and uSGB_48155 are too distant to consider your MAG as a potential new member of them because their Mash average distance is >5%. While it appears that your MAG is a new member of uSGB_48157 with an average Mash distance of ~2.5%.

Sorry for the long message, but I hope this helps.

Many thanks,
Francesco

Hi Francesco

Thank you so much for your explanation, and I must apologise for missing the explanation on the wiki, it’s very bad of me to ask a question that has already been answered!

Many thanks for your patience

Mick

1 Like