Inquiry regarding MetaPhlAn SGBs phylogenetic tree

Hi,

I’m planning to do diversity analysis, e.g. UniFrac Distance calculation, using a merged MetaPhlAn output profile and the MetaPhlAn SGB phylogenetic tree (MetaPhlAn/mpa_vJan21_CHOCOPhlAnSGB_202103.nwk at master · biobakery/MetaPhlAn · GitHub).

To this end, first, I opened that phylogenetic tree in R using the ape::read.tree() function to prune tips not in my metaphlan profile taxonomy.

And then I found that the tip labels of the tree are written in numeric values, not taxonomic strings such as 'k__Bacteria|p__Fusobacteria|c__CFGB15|o__OFGB15|f__FGB15|g__GGB18|s__GGB18_SGB19|t__SGB19'.

Codes:

tree <- ape::read.tree("mpa_vJan21_CHOCOPhlAnSGB_202103.nwk")
tree$tip.label

Returns:

> tree$tip.label
   [1] "87903" "77821" "2406"  "2405"  "77822" "77819" "27887" "77828" "27888" "2407"  "2408"  "63416" "36273" "35491" "21384" "79383"
  [17] "21385" "27889" "79382" "2403"  "2404"  "21383" "35481" "27885" "79378" "21386" "79379" "49843" "21541" "21540" "27901" "2394" 
  [33] "27903" "27902" "2398"  "27905" "21545" "35427" "27906" "2400"  "27904" "21418" "27899" "77871" "2389"  "77872" "27891" "49844"
  [49] "27890" "84915" "27894" "27892" "27893" "21543" "27896" "27895" "84913" "84914" "27898" "2412"  "2410"  "2411"  "21544" "21417"
  [65] "35906" "40230" "35908" "35910" "40231" "35909" "40229" "35916" "40247" "35915" "35920" "35912" "40237" "35917" "35921" "35918"
...

What do those numbers mean?

Is it originated from the names of SGBs? (It seems like numbers are similar with names of SGBs in mpa_vJan21_CHOCOPhlAnSGB_202103.pkl file, but I can’t guarantee it.).

Best regards,

Hi @1112
U would suggest you to use the MetaPhlAn/calculate_diversity.R at master · biobakery/MetaPhlAn · GitHub script for such task. For beta diversity the options are bray-curtis, jaccard, weighted-unifrac, unweighted-unifrac, clr, aitchison.

1 Like

Hi @1112,

It seems that the numerical labels in the tree correspond to the NCBI taxID’s that you can find in MetaPhlAn’s output file:

For the taxonomy that doesn’t have an NCBI taxID I suppose that the tree name corresponds to the number after the SGB___ in the clade_name field.
These are the conclusions I reached after reading the calculate_diversity.R code and doing some comprobations on R myself, but I could be wrong.

Best regards,

gsergom

The numerical labels in the tree correspond, in all cases, to the SGB id

1 Like

@aitor.blancomiguez @gsergom

Thanks very much for your help!

The explanations solved all my questions.

And I’ll check the calcaulte_diversity.R script you suggested, too.

Best regards,

Hi @aitor.blancomiguez,
I have a very similar question as @1112 above. I understand that the tip labels of the latest MetaPhlAn phylogenetic tree (MetaPhlAn/mpa_vJan21_CHOCOPhlAnSGB_202103.nwk at master · biobakery/MetaPhlAn · GitHub) correspond to the SGB id. I would really like to merge the tree with my phyloseq object in R (created from the MetaPhlAn table).

So my question is, does any file exist somewhere which links this SGB id (from the tree) with the species names exactly as they are written in the MetaPhlan table output? I looked at this file (MetaPhlAn/mpa_vJan21_CHOCOPhlAnSGB_202103_SGB2GTDB.tsv at master · biobakery/MetaPhlAn · GitHub) but the naming convention seems different from my MetaPhlan table.

Thank you,
Brandi

Hi @Brandilyn_Peters
Yes, you can have a look at this file: http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_species.txt.bz2

1 Like

@aitor.blancomiguez - thank you!!! This works well. There are some species that are in this file that are not in the tree (and vice versa). Is that normal?

Thank you,
Brandi

Hi @Brandilyn_Peters
The eukaryotic species in metaphlan are not present in the tree. Moreover, for a few number of SGBs in which we had only MQ genomes were discarded by the quality control of the phylogeny. However, all SGBs in the tree should be present in the file, could you send me the ids of those cases?

Hi @aitor.blancomiguez ,
Thanks for your response. I just checked it and realized that all the SGBs in the tree are actually in the file (I needed to remove the “_group” string from some of the SGB IDs in the file to have them all match with the tree). So there is no problem. Thanks again!

Brandi

Greetings everyone,

I am reaching out to seek your guidance regarding the import of the Metaphlan 4 tree into a phyloseq object. Unfortunately, I have been facing some difficulties in this process, and I could not decipher how to map it with the supplementary file mentioned.

If someone could generously share their expertise and provide me with the code or strategy used to import the Metaphlan 4 tree into a phyloseq object, it would be greatly appreciated.

Thank you kindly in advance for your help and support.

Best regards,
Vinícius.

1 Like

Hi @Brandilyn_Peters,
Thank you for your inquiries. They helped answer some of the questions I had.

How did you go about relabeling the tips of the MetaPhlAn 4 phylogenetic tree (MetaPhlAn/mpa_vJan21_CHOCOPhlAnSGB_202103.nwk at master · biobakery/MetaPhlAn · GitHub )? I am also trying to merge it with my phyloseq object in R. Did you have to manually find and replace the SGB id with the corresponding species name? I’ve also noticed that some of the SGB id’s have multiple corresponding species, which are separated by a comma in http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_species.txt.bz2
I would really appreciate some guidance.

Thank you,
j.ryou

Hi @j.ryou,

This is the code I used to format the tree and merge it with my phyloseq object. Hope it works for you.

tree=read.tree(“H:/mpa_vJan21_CHOCOPhlAnSGB_202103.nwk”)
tree_meta=read.delim(“H:/mpa_vJan21_CHOCOPhlAnSGB_202103_species.txt”,header=F)
tree_meta$V1=gsub(“SGB|_group”,“”,tree_meta$V1) ###added replacement for “_group”
tree_meta=separate(tree_meta,V2,into=c(“V2”,“Other1”,“Other2”,“Other3”,“Other4”,“Other5”,“Other6”),
sep=“,”)
tree_meta=tree_meta[!duplicated(tree_meta[c(‘V2’)]), ]

tree_meta2=subset(tree_meta,V1 %in% tree$tip.label)
tree2=drop.tip(tree,setdiff(tree$tip.label,tree_meta2$V1))
tree_meta2=tree_meta2[match(tree2$tip.label,tree_meta2$V1),]
identical(tree2$tip.label,tree_meta2$V1)
tree2$tip.label=tree_meta2$V2
phy_tree(phy_obj)=tree2

1 Like

Thank you so much, @Brandilyn_Peters!

Hi @j.ryou

How did you handle multiple species matching to the same SGB ID?

I’ve followed everything until this… also really nice to share your code @Brandilyn_Peters!

Thanks!

Hi @em_joyce,

Sorry for the late response. What are you using as your input files?

I have my feature table, with the taxa name (clade_name). And I’m trying to merge it with the SGB-taxa linkage provided by bioBakery (mpa_vJan21_CHOCOPhlAnSGB_202103_species.txt) to have SGBs in my feature table. This way I wouldn’t need to change the tip labels in the tree. But because the .txt file links SGBs to multiple taxa (separated by commas), it isn’t so easy to merge. This relates to your question, because in the code from the response to your original post, it only takes the first taxa name in the list to replace tip labels in the tree. Did that result in lost information? Or was it ok?

Hi,
Is there a file like this also for the other versions of chocophlan? and if so, could you point the location on github?

Cheers and thanks,
Menia

Hi,

I do not remember where I found the *species.txt file, but some of the other CHOCOPHLAN files can be found here: https://github.com/biobakery/MetaPhlAn/tree/master/metaphlan/utils

I modified @Brandilyn_Peters code (thank you, Brandilyn!) to include package names and comments here:

library(magrittr)
# Need to replace adress with local adress with fitting nwk and txt file location
tree <- phyloseq::read_tree("~/yourfilepath/mpa_vJun23_CHOCOPhlAnSGB_202307.nwk")

# Works for both file types of *species.txt and * SGBtoGTDB.tsv, your choice which to use
tree_meta <- read.delim("~/yourfilepath/mpa_vJun23_CHOCOPhlAnSGB_202307_species.txt",header=F)
tree_meta <- read.delim("~/yourfilepath/mpa_vJun23_CHOCOPhlAnSGB_202307_SGB2GTDB.tsv",header=F)

tree_meta$V1 <- gsub("SGB|_group","",tree_meta$V1) # replace "SGB" and "_group" with blank
tree_meta <- tidyr::separate_wider_delim(tree_meta,cols=V2,
                                         names=c("V2","Other1","Other2","Other3","Other4","Other5","Other6"), # Separates the synonym names for SGB's into separate variables named "Other#" 
                                         delim=",",
                                         too_few = "align_start", # When too few synonyms, keep the name in V2, etc. and fill remaining Others with NA's
                                         too_many = "drop")   # when more than 1+6 synonyms, the excess is dropped 

tree_meta <- tree_meta %>%
  dplyr::distinct(V2, .keep_all = TRUE) # keep unique, remove duplicate rows, from obs. 36822 -> 34556

tree_meta2 <- subset(tree_meta,V1 %in% tree$tip.label) # subsets tree_meta2 to contain only tip labels as in my MetaPhlan4 tree.tree = tree.nwk file
tree2 <- ape::drop.tip(tree,setdiff(tree$tip.label,tree_meta2$V1)) # subsets tree2 to contain only tip labels present in tree_meta2
tree_meta2 <- tree_meta2[match(tree2$tip.label,tree_meta2$V1),] # reorder the metadata to match the new tree, tree2
identical(tree2$tip.label,tree_meta2$V1) # check if reordering was successful, should read "TRUE"
tree2$tip.label <- tree_meta2$V2 # Update tip labels to equal the labels in tree_meta
# phyloseq::phy_tree(phy_obj) <- tree2
phy_obj <- phyloseq::phy_tree(tree2) # Assign the modified tree to a phylogenetic object using phyloseq