The bioBakery help forum

MetaPhlaAn3 Genome List

Hi All,

l’m attempting to use the newick formatted tree provided with MetaPhlAn3 to compute Generalised Unifrac and Phylogenetic Diversity metrics in R but I’m having trouble removing the assembly ID prefix from the leaf names (the leaf names need to match the row names from the species relative abundance data frame).

For example I would like the leaves to be named
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Metakosakonia|s__Kluyvera_intestini

instead of

GCA_001856865|k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Metakosakonia|s__Kluyvera_intestini

Does anybody have a copy of this tree available without these IDs?
Or else a file listing the accession numbers of all genomes in the v30 database along with their associated taxonomy that I could use for a batch string replace?

All the best,
Calum

Should be easy enough to do a string replacement. I’m not the most competent R programmer, but something like

R> taxa <- c('GCA_001856865|k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Metakosakonia|s__Kluyvera_intestini', 'GCA_001856866|k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Metakosakonia|s__Metakosakonia_madeup')
R> gsub('GCA_[0-9]+\\|', '', taxa)
[1] "k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Metakosakonia|s__Kluyvera_intestini"  
[2] "k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Metakosakonia|s__Metakosakonia_madeup"

Thanks @kescobo
Either my brain stopped working or I was trying to overcomplicate matters by attempting to the tree with Unix before reading into R.

That’s not totally insensible - a similar thing can be accomplished with sed or perl, you just need to get the regular expression right. Something like the following (untested)

$ sed -E 's/^GCA_[0-9]+\|//' oldfile.txt > newfile.txt

First, the tree file is a godsend. THANK YOU!

Second, it’s not just simple character substitution to do it in the context of the newick file because there’s numbers in many taxonomies.

I reformatted the tree so that the leaves are only the species level taxonomy, removing the rest of the string. If you like you can download it here until I clean up my dropbox. Feel free to send me gifts.

If you have the full taxonomy you can grab the species only easily in bash; eg.
echo $fulltax | rev | cut -d “|” -f 1 | rev > species
or
species=(echo $fulltax | rev | cut -d “|” -f 1 | rev)

I made a separate 2 column file with all the full taxonomic strings and the replacement species text, and ran a while read loop in bash

while read fulltax species;do
sed -i “s/{fulltax}/{species}/g” nwk_file
done < 2column_file

Hi, may I ask where can I find this newick formatted tree provided with MetaPhlAn3? I checked this blog also (metaphlan3 phylogenetic tree · Issue #92 · biobakery/MetaPhlAn · GitHub), but couldn’t find the tree mentioned. I appreciated any suggestions.

Is this the one the tree file that I can used for a tree visulaization? MetaPhlAn/metaphlan/utils at master · biobakery/MetaPhlAn · GitHub

You can find it in the MetaPhlAn repository (https://github.com/biobakery/MetaPhlAn/tree/master/metaphlan/utils), the filename is mpa_v30_CHOCOPhlAn_201901_species_tree.nwk

1 Like