MetaPhlaAn3 Genome List

cazzlewazzle89 · September 3, 2020, 4:26pm

Hi All,

l’m attempting to use the newick formatted tree provided with MetaPhlAn3 to compute Generalised Unifrac and Phylogenetic Diversity metrics in R but I’m having trouble removing the assembly ID prefix from the leaf names (the leaf names need to match the row names from the species relative abundance data frame).

instead of

Does anybody have a copy of this tree available without these IDs?
Or else a file listing the accession numbers of all genomes in the v30 database along with their associated taxonomy that I could use for a batch string replace?

All the best,
Calum

kescobo · September 3, 2020, 9:04pm

Should be easy enough to do a string replacement. I’m not the most competent R programmer, but something like

R> taxa <- c('GCA_001856865|k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Metakosakonia|s__Kluyvera_intestini', 'GCA_001856866|k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Metakosakonia|s__Metakosakonia_madeup')
R> gsub('GCA_[0-9]+\\|', '', taxa)
[1] "k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Metakosakonia|s__Kluyvera_intestini"  
[2] "k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Metakosakonia|s__Metakosakonia_madeup"

cazzlewazzle89 · September 15, 2020, 7:25am

Thanks @kescobo
Either my brain stopped working or I was trying to overcomplicate matters by attempting to the tree with Unix before reading into R.

kescobo · September 15, 2020, 7:17pm

That’s not totally insensible - a similar thing can be accomplished with sed or perl, you just need to get the regular expression right. Something like the following (untested)

$ sed -E 's/^GCA_[0-9]+\|//' oldfile.txt > newfile.txt

bigdoyle · October 27, 2020, 8:16pm

First, the tree file is a godsend. THANK YOU!

Second, it’s not just simple character substitution to do it in the context of the newick file because there’s numbers in many taxonomies.

I reformatted the tree so that the leaves are only the species level taxonomy, removing the rest of the string. If you like you can download it here until I clean up my dropbox. Feel free to send me gifts.

If you have the full taxonomy you can grab the species only easily in bash; eg.
echo $fulltax | rev | cut -d “|” -f 1 | rev > species
or
species=(echo $fulltax | rev | cut -d “|” -f 1 | rev)

I made a separate 2 column file with all the full taxonomic strings and the replacement species text, and ran a while read loop in bash

while read fulltax species;do
sed -i “s/{fulltax}/{species}/g” nwk_file
done < 2column_file

LuZhang · February 7, 2021, 4:38pm

Hi, may I ask where can I find this newick formatted tree provided with MetaPhlAn3? I checked this blog also (metaphlan3 phylogenetic tree · Issue #92 · biobakery/MetaPhlAn · GitHub), but couldn’t find the tree mentioned. I appreciated any suggestions.

Is this the one the tree file that I can used for a tree visulaization? MetaPhlAn/metaphlan/utils at master · biobakery/MetaPhlAn · GitHub

fbeghini · February 8, 2021, 8:56am

You can find it in the MetaPhlAn repository (https://github.com/biobakery/MetaPhlAn/tree/master/metaphlan/utils), the filename is mpa_v30_CHOCOPhlAn_201901_species_tree.nwk

cazzlewazzle89 · March 14, 2021, 12:55pm

Hi Francesco,

Is there a newer version of the tree available? I have tried using the version linked above to generate UniFrac distances in R but some species from my MetaPhlAn3 outputs are not present in the tree?

As far as I can tell we are using the same version of the MPA database “mpa_v30_CHOCOPhlAn_201901”.

All the best,
Calum

LuZhang · March 31, 2021, 1:25pm

I have the same question as you(some species are not present in the tree). Did you find a solution? @cazzlewazzle89

LuZhang · March 31, 2021, 1:31pm

Hi I tried with the tree you mentioned in the linkage mpa_v30_CHOCOPhlAn_201901 Thank you! @fbeghini

However, I had a little question related to the phylum.

But when I looked into the Phylum level, I have noticed that this species is nearer to species in phylum Firmicutes. Should it in the cluster phylum Bacteroidetes?

I am appreciated for any advice. Thanks in advance!

Best Regards,
Lu

D11 · March 14, 2022, 3:53pm

Thanks for your response @kescobo .

Nonetheless, if another newbie in metagenomics as me reads this issue I would like to add the following.

To remove the GCA prefix in each tip label of the MetaPhlan tree you must take into account the following:

library(phyloseq)
#Open the MetaPhlan tree (in Newick format) with the function read_tree()

tree ← read_tree(“mpa_v30_CHOCOPhlAn_201901_species_tree.nwk”)

#The vector to change is tree$tip.label but be careful not to change the class(tree).

Run the next command to achieve @cazzlewazzle89 requirements in order to be able to construct phyloseq object including phylogenetic information:

tree$tip.label ← gsub(‘GCA_[0-9]+\|’, ‘’, tree$tip.label)

Topic		Replies	Views
Metaphlan4 tree for unifrac etc MetaPhlAn	4	825	December 15, 2023
Inquiry regarding MetaPhlAn SGBs phylogenetic tree MetaPhlAn	18	1463	July 4, 2024
Need CHOCOPhlAnSGB_202212.nwk MetaPhlAn	6	864	May 17, 2023
Announcing MetaPhlAn 4.1.1 release MetaPhlAn	6	966	June 5, 2025
Running calculate_unifrac.R ignores first sample MetaPhlAn	1	540	January 29, 2021

MetaPhlaAn3 Genome List

Related topics