Inquiry regarding MetaPhlAn SGBs phylogenetic tree

Hi,

I’m planning to do diversity analysis, e.g. UniFrac Distance calculation, using a merged MetaPhlAn output profile and the MetaPhlAn SGB phylogenetic tree (MetaPhlAn/mpa_vJan21_CHOCOPhlAnSGB_202103.nwk at master · biobakery/MetaPhlAn · GitHub).

To this end, first, I opened that phylogenetic tree in R using the ape::read.tree() function to prune tips not in my metaphlan profile taxonomy.

And then I found that the tip labels of the tree are written in numeric values, not taxonomic strings such as 'k__Bacteria|p__Fusobacteria|c__CFGB15|o__OFGB15|f__FGB15|g__GGB18|s__GGB18_SGB19|t__SGB19'.

Codes:

tree <- ape::read.tree("mpa_vJan21_CHOCOPhlAnSGB_202103.nwk")
tree$tip.label

Returns:

> tree$tip.label
   [1] "87903" "77821" "2406"  "2405"  "77822" "77819" "27887" "77828" "27888" "2407"  "2408"  "63416" "36273" "35491" "21384" "79383"
  [17] "21385" "27889" "79382" "2403"  "2404"  "21383" "35481" "27885" "79378" "21386" "79379" "49843" "21541" "21540" "27901" "2394" 
  [33] "27903" "27902" "2398"  "27905" "21545" "35427" "27906" "2400"  "27904" "21418" "27899" "77871" "2389"  "77872" "27891" "49844"
  [49] "27890" "84915" "27894" "27892" "27893" "21543" "27896" "27895" "84913" "84914" "27898" "2412"  "2410"  "2411"  "21544" "21417"
  [65] "35906" "40230" "35908" "35910" "40231" "35909" "40229" "35916" "40247" "35915" "35920" "35912" "40237" "35917" "35921" "35918"
...

What do those numbers mean?

Is it originated from the names of SGBs? (It seems like numbers are similar with names of SGBs in mpa_vJan21_CHOCOPhlAnSGB_202103.pkl file, but I can’t guarantee it.).

Best regards,

Hi @1112
U would suggest you to use the MetaPhlAn/calculate_diversity.R at master · biobakery/MetaPhlAn · GitHub script for such task. For beta diversity the options are bray-curtis, jaccard, weighted-unifrac, unweighted-unifrac, clr, aitchison.

Hi @1112,

It seems that the numerical labels in the tree correspond to the NCBI taxID’s that you can find in MetaPhlAn’s output file:

For the taxonomy that doesn’t have an NCBI taxID I suppose that the tree name corresponds to the number after the SGB___ in the clade_name field.
These are the conclusions I reached after reading the calculate_diversity.R code and doing some comprobations on R myself, but I could be wrong.

Best regards,

gsergom

The numerical labels in the tree correspond, in all cases, to the SGB id

@aitor.blancomiguez @gsergom

Thanks very much for your help!

The explanations solved all my questions.

And I’ll check the calcaulte_diversity.R script you suggested, too.

Best regards,

Hi @aitor.blancomiguez,
I have a very similar question as @1112 above. I understand that the tip labels of the latest MetaPhlAn phylogenetic tree (MetaPhlAn/mpa_vJan21_CHOCOPhlAnSGB_202103.nwk at master · biobakery/MetaPhlAn · GitHub) correspond to the SGB id. I would really like to merge the tree with my phyloseq object in R (created from the MetaPhlAn table).

So my question is, does any file exist somewhere which links this SGB id (from the tree) with the species names exactly as they are written in the MetaPhlan table output? I looked at this file (MetaPhlAn/mpa_vJan21_CHOCOPhlAnSGB_202103_SGB2GTDB.tsv at master · biobakery/MetaPhlAn · GitHub) but the naming convention seems different from my MetaPhlan table.

Thank you,
Brandi

Hi @Brandilyn_Peters
Yes, you can have a look at this file: http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_species.txt.bz2

@aitor.blancomiguez - thank you!!! This works well. There are some species that are in this file that are not in the tree (and vice versa). Is that normal?

Thank you,
Brandi

Hi @Brandilyn_Peters
The eukaryotic species in metaphlan are not present in the tree. Moreover, for a few number of SGBs in which we had only MQ genomes were discarded by the quality control of the phylogeny. However, all SGBs in the tree should be present in the file, could you send me the ids of those cases?

Hi @aitor.blancomiguez ,
Thanks for your response. I just checked it and realized that all the SGBs in the tree are actually in the file (I needed to remove the “_group” string from some of the SGB IDs in the file to have them all match with the tree). So there is no problem. Thanks again!

Brandi