The bioBakery help forum

MetaPhlaAn3 Genome List

Hi All,

l’m attempting to use the newick formatted tree provided with MetaPhlAn3 to compute Generalised Unifrac and Phylogenetic Diversity metrics in R but I’m having trouble removing the assembly ID prefix from the leaf names (the leaf names need to match the row names from the species relative abundance data frame).

For example I would like the leaves to be named

instead of


Does anybody have a copy of this tree available without these IDs?
Or else a file listing the accession numbers of all genomes in the v30 database along with their associated taxonomy that I could use for a batch string replace?

All the best,

Should be easy enough to do a string replacement. I’m not the most competent R programmer, but something like

R> taxa <- c('GCA_001856865|k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Metakosakonia|s__Kluyvera_intestini', 'GCA_001856866|k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Metakosakonia|s__Metakosakonia_madeup')
R> gsub('GCA_[0-9]+\\|', '', taxa)
[1] "k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Metakosakonia|s__Kluyvera_intestini"  
[2] "k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Metakosakonia|s__Metakosakonia_madeup"

Thanks @kescobo
Either my brain stopped working or I was trying to overcomplicate matters by attempting to the tree with Unix before reading into R.

That’s not totally insensible - a similar thing can be accomplished with sed or perl, you just need to get the regular expression right. Something like the following (untested)

$ sed -E 's/^GCA_[0-9]+\|//' oldfile.txt > newfile.txt