To this end, first, I opened that phylogenetic tree in R using the ape::read.tree() function to prune tips not in my metaphlan profile taxonomy.
And then I found that the tip labels of the tree are written in numeric values, not taxonomic strings such as 'k__Bacteria|p__Fusobacteria|c__CFGB15|o__OFGB15|f__FGB15|g__GGB18|s__GGB18_SGB19|t__SGB19'.
Codes:
tree <- ape::read.tree("mpa_vJan21_CHOCOPhlAnSGB_202103.nwk")
tree$tip.label
Is it originated from the names of SGBs? (It seems like numbers are similar with names of SGBs in mpa_vJan21_CHOCOPhlAnSGB_202103.pkl file, but I can’t guarantee it.).
For the taxonomy that doesn’t have an NCBI taxID I suppose that the tree name corresponds to the number after the SGB___ in the clade_name field.
These are the conclusions I reached after reading the calculate_diversity.R code and doing some comprobations on R myself, but I could be wrong.
@aitor.blancomiguez - thank you!!! This works well. There are some species that are in this file that are not in the tree (and vice versa). Is that normal?
Hi @Brandilyn_Peters
The eukaryotic species in metaphlan are not present in the tree. Moreover, for a few number of SGBs in which we had only MQ genomes were discarded by the quality control of the phylogeny. However, all SGBs in the tree should be present in the file, could you send me the ids of those cases?
Hi @aitor.blancomiguez ,
Thanks for your response. I just checked it and realized that all the SGBs in the tree are actually in the file (I needed to remove the “_group” string from some of the SGB IDs in the file to have them all match with the tree). So there is no problem. Thanks again!
I am reaching out to seek your guidance regarding the import of the Metaphlan 4 tree into a phyloseq object. Unfortunately, I have been facing some difficulties in this process, and I could not decipher how to map it with the supplementary file mentioned.
If someone could generously share their expertise and provide me with the code or strategy used to import the Metaphlan 4 tree into a phyloseq object, it would be greatly appreciated.
Thank you kindly in advance for your help and support.
I have my feature table, with the taxa name (clade_name). And I’m trying to merge it with the SGB-taxa linkage provided by bioBakery (mpa_vJan21_CHOCOPhlAnSGB_202103_species.txt) to have SGBs in my feature table. This way I wouldn’t need to change the tip labels in the tree. But because the .txt file links SGBs to multiple taxa (separated by commas), it isn’t so easy to merge. This relates to your question, because in the code from the response to your original post, it only takes the first taxa name in the list to replace tip labels in the tree. Did that result in lost information? Or was it ok?
I modified @Brandilyn_Peters code (thank you, Brandilyn!) to include package names and comments here:
library(magrittr)
# Need to replace adress with local adress with fitting nwk and txt file location
tree <- phyloseq::read_tree("~/yourfilepath/mpa_vJun23_CHOCOPhlAnSGB_202307.nwk")
# Works for both file types of *species.txt and * SGBtoGTDB.tsv, your choice which to use
tree_meta <- read.delim("~/yourfilepath/mpa_vJun23_CHOCOPhlAnSGB_202307_species.txt",header=F)
tree_meta <- read.delim("~/yourfilepath/mpa_vJun23_CHOCOPhlAnSGB_202307_SGB2GTDB.tsv",header=F)
tree_meta$V1 <- gsub("SGB|_group","",tree_meta$V1) # replace "SGB" and "_group" with blank
tree_meta <- tidyr::separate_wider_delim(tree_meta,cols=V2,
names=c("V2","Other1","Other2","Other3","Other4","Other5","Other6"), # Separates the synonym names for SGB's into separate variables named "Other#"
delim=",",
too_few = "align_start", # When too few synonyms, keep the name in V2, etc. and fill remaining Others with NA's
too_many = "drop") # when more than 1+6 synonyms, the excess is dropped
tree_meta <- tree_meta %>%
dplyr::distinct(V2, .keep_all = TRUE) # keep unique, remove duplicate rows, from obs. 36822 -> 34556
tree_meta2 <- subset(tree_meta,V1 %in% tree$tip.label) # subsets tree_meta2 to contain only tip labels as in my MetaPhlan4 tree.tree = tree.nwk file
tree2 <- ape::drop.tip(tree,setdiff(tree$tip.label,tree_meta2$V1)) # subsets tree2 to contain only tip labels present in tree_meta2
tree_meta2 <- tree_meta2[match(tree2$tip.label,tree_meta2$V1),] # reorder the metadata to match the new tree, tree2
identical(tree2$tip.label,tree_meta2$V1) # check if reordering was successful, should read "TRUE"
tree2$tip.label <- tree_meta2$V2 # Update tip labels to equal the labels in tree_meta
# phyloseq::phy_tree(phy_obj) <- tree2
phy_obj <- phyloseq::phy_tree(tree2) # Assign the modified tree to a phylogenetic object using phyloseq