Inquiry regarding MetaPhlAn SGBs phylogenetic tree

Hi,

I’m planning to do diversity analysis, e.g. UniFrac Distance calculation, using a merged MetaPhlAn output profile and the MetaPhlAn SGB phylogenetic tree (MetaPhlAn/mpa_vJan21_CHOCOPhlAnSGB_202103.nwk at master · biobakery/MetaPhlAn · GitHub).

To this end, first, I opened that phylogenetic tree in R using the ape::read.tree() function to prune tips not in my metaphlan profile taxonomy.

And then I found that the tip labels of the tree are written in numeric values, not taxonomic strings such as 'k__Bacteria|p__Fusobacteria|c__CFGB15|o__OFGB15|f__FGB15|g__GGB18|s__GGB18_SGB19|t__SGB19'.

Codes:

tree <- ape::read.tree("mpa_vJan21_CHOCOPhlAnSGB_202103.nwk")
tree$tip.label

Returns:

> tree$tip.label
   [1] "87903" "77821" "2406"  "2405"  "77822" "77819" "27887" "77828" "27888" "2407"  "2408"  "63416" "36273" "35491" "21384" "79383"
  [17] "21385" "27889" "79382" "2403"  "2404"  "21383" "35481" "27885" "79378" "21386" "79379" "49843" "21541" "21540" "27901" "2394" 
  [33] "27903" "27902" "2398"  "27905" "21545" "35427" "27906" "2400"  "27904" "21418" "27899" "77871" "2389"  "77872" "27891" "49844"
  [49] "27890" "84915" "27894" "27892" "27893" "21543" "27896" "27895" "84913" "84914" "27898" "2412"  "2410"  "2411"  "21544" "21417"
  [65] "35906" "40230" "35908" "35910" "40231" "35909" "40229" "35916" "40247" "35915" "35920" "35912" "40237" "35917" "35921" "35918"
...

What do those numbers mean?

Is it originated from the names of SGBs? (It seems like numbers are similar with names of SGBs in mpa_vJan21_CHOCOPhlAnSGB_202103.pkl file, but I can’t guarantee it.).

Best regards,

Hi @1112
U would suggest you to use the MetaPhlAn/calculate_diversity.R at master · biobakery/MetaPhlAn · GitHub script for such task. For beta diversity the options are bray-curtis, jaccard, weighted-unifrac, unweighted-unifrac, clr, aitchison.

1 Like

Hi @1112,

It seems that the numerical labels in the tree correspond to the NCBI taxID’s that you can find in MetaPhlAn’s output file:

For the taxonomy that doesn’t have an NCBI taxID I suppose that the tree name corresponds to the number after the SGB___ in the clade_name field.
These are the conclusions I reached after reading the calculate_diversity.R code and doing some comprobations on R myself, but I could be wrong.

Best regards,

gsergom

The numerical labels in the tree correspond, in all cases, to the SGB id

1 Like

@aitor.blancomiguez @gsergom

Thanks very much for your help!

The explanations solved all my questions.

And I’ll check the calcaulte_diversity.R script you suggested, too.

Best regards,

Hi @aitor.blancomiguez,
I have a very similar question as @1112 above. I understand that the tip labels of the latest MetaPhlAn phylogenetic tree (MetaPhlAn/mpa_vJan21_CHOCOPhlAnSGB_202103.nwk at master · biobakery/MetaPhlAn · GitHub) correspond to the SGB id. I would really like to merge the tree with my phyloseq object in R (created from the MetaPhlAn table).

So my question is, does any file exist somewhere which links this SGB id (from the tree) with the species names exactly as they are written in the MetaPhlan table output? I looked at this file (MetaPhlAn/mpa_vJan21_CHOCOPhlAnSGB_202103_SGB2GTDB.tsv at master · biobakery/MetaPhlAn · GitHub) but the naming convention seems different from my MetaPhlan table.

Thank you,
Brandi

Hi @Brandilyn_Peters
Yes, you can have a look at this file: http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_species.txt.bz2

1 Like

@aitor.blancomiguez - thank you!!! This works well. There are some species that are in this file that are not in the tree (and vice versa). Is that normal?

Thank you,
Brandi

Hi @Brandilyn_Peters
The eukaryotic species in metaphlan are not present in the tree. Moreover, for a few number of SGBs in which we had only MQ genomes were discarded by the quality control of the phylogeny. However, all SGBs in the tree should be present in the file, could you send me the ids of those cases?

Hi @aitor.blancomiguez ,
Thanks for your response. I just checked it and realized that all the SGBs in the tree are actually in the file (I needed to remove the “_group” string from some of the SGB IDs in the file to have them all match with the tree). So there is no problem. Thanks again!

Brandi

Greetings everyone,

I am reaching out to seek your guidance regarding the import of the Metaphlan 4 tree into a phyloseq object. Unfortunately, I have been facing some difficulties in this process, and I could not decipher how to map it with the supplementary file mentioned.

If someone could generously share their expertise and provide me with the code or strategy used to import the Metaphlan 4 tree into a phyloseq object, it would be greatly appreciated.

Thank you kindly in advance for your help and support.

Best regards,
Vinícius.

1 Like

Hi @Brandilyn_Peters,
Thank you for your inquiries. They helped answer some of the questions I had.

How did you go about relabeling the tips of the MetaPhlAn 4 phylogenetic tree (MetaPhlAn/mpa_vJan21_CHOCOPhlAnSGB_202103.nwk at master · biobakery/MetaPhlAn · GitHub )? I am also trying to merge it with my phyloseq object in R. Did you have to manually find and replace the SGB id with the corresponding species name? I’ve also noticed that some of the SGB id’s have multiple corresponding species, which are separated by a comma in http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_species.txt.bz2
I would really appreciate some guidance.

Thank you,
j.ryou

Hi @j.ryou,

This is the code I used to format the tree and merge it with my phyloseq object. Hope it works for you.

tree=read.tree(“H:/mpa_vJan21_CHOCOPhlAnSGB_202103.nwk”)
tree_meta=read.delim(“H:/mpa_vJan21_CHOCOPhlAnSGB_202103_species.txt”,header=F)
tree_meta$V1=gsub(“SGB|_group”,“”,tree_meta$V1) ###added replacement for “_group”
tree_meta=separate(tree_meta,V2,into=c(“V2”,“Other1”,“Other2”,“Other3”,“Other4”,“Other5”,“Other6”),
sep=“,”)
tree_meta=tree_meta[!duplicated(tree_meta[c(‘V2’)]), ]

tree_meta2=subset(tree_meta,V1 %in% tree$tip.label)
tree2=drop.tip(tree,setdiff(tree$tip.label,tree_meta2$V1))
tree_meta2=tree_meta2[match(tree2$tip.label,tree_meta2$V1),]
identical(tree2$tip.label,tree_meta2$V1)
tree2$tip.label=tree_meta2$V2
phy_tree(phy_obj)=tree2

1 Like

Thank you so much, @Brandilyn_Peters!

Hi @j.ryou

How did you handle multiple species matching to the same SGB ID?

I’ve followed everything until this… also really nice to share your code @Brandilyn_Peters!

Thanks!

Hi @em_joyce,

Sorry for the late response. What are you using as your input files?

I have my feature table, with the taxa name (clade_name). And I’m trying to merge it with the SGB-taxa linkage provided by bioBakery (mpa_vJan21_CHOCOPhlAnSGB_202103_species.txt) to have SGBs in my feature table. This way I wouldn’t need to change the tip labels in the tree. But because the .txt file links SGBs to multiple taxa (separated by commas), it isn’t so easy to merge. This relates to your question, because in the code from the response to your original post, it only takes the first taxa name in the list to replace tip labels in the tree. Did that result in lost information? Or was it ok?