l’m attempting to use the newick formatted tree provided with MetaPhlAn3 to compute Generalised Unifrac and Phylogenetic Diversity metrics in R but I’m having trouble removing the assembly ID prefix from the leaf names (the leaf names need to match the row names from the species relative abundance data frame).
For example I would like the leaves to be named k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Metakosakonia|s__Kluyvera_intestini
Does anybody have a copy of this tree available without these IDs?
Or else a file listing the accession numbers of all genomes in the v30 database along with their associated taxonomy that I could use for a batch string replace?
That’s not totally insensible - a similar thing can be accomplished with sed or perl, you just need to get the regular expression right. Something like the following (untested)
$ sed -E 's/^GCA_[0-9]+\|//' oldfile.txt > newfile.txt
Second, it’s not just simple character substitution to do it in the context of the newick file because there’s numbers in many taxonomies.
I reformatted the tree so that the leaves are only the species level taxonomy, removing the rest of the string. If you like you can download it here until I clean up my dropbox. Feel free to send me gifts.
If you have the full taxonomy you can grab the species only easily in bash; eg.
echo $fulltax | rev | cut -d “|” -f 1 | rev > species
or species=(echo $fulltax | rev | cut -d “|” -f 1 | rev)
I made a separate 2 column file with all the full taxonomic strings and the replacement species text, and ran a while read loop in bash
while read fulltax species;do
sed -i “s/{fulltax}/{species}/g” nwk_file
done < 2column_file
Is there a newer version of the tree available? I have tried using the version linked above to generate UniFrac distances in R but some species from my MetaPhlAn3 outputs are not present in the tree?
Hi I tried with the tree you mentioned in the linkage mpa_v30_CHOCOPhlAn_201901 Thank you! @fbeghini
However, I had a little question related to the phylum.
For example here : I am interested in the species GCA 003096855|k Bacteria|p Bacteroidetes|c Bacteroidia|o Bacteroidales|f Bacteroidaceae|g Bacteroides|s Bacteroides galacturonicus
But when I looked into the Phylum level, I have noticed that this species is nearer to species in phylum Firmicutes. Should it in the cluster phylum Bacteroidetes?
I am appreciated for any advice. Thanks in advance!