MetaPhlaAn3 Genome List

Hi All,

l’m attempting to use the newick formatted tree provided with MetaPhlAn3 to compute Generalised Unifrac and Phylogenetic Diversity metrics in R but I’m having trouble removing the assembly ID prefix from the leaf names (the leaf names need to match the row names from the species relative abundance data frame).

For example I would like the leaves to be named
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Metakosakonia|s__Kluyvera_intestini

instead of

GCA_001856865|k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Metakosakonia|s__Kluyvera_intestini

Does anybody have a copy of this tree available without these IDs?
Or else a file listing the accession numbers of all genomes in the v30 database along with their associated taxonomy that I could use for a batch string replace?

All the best,
Calum

Should be easy enough to do a string replacement. I’m not the most competent R programmer, but something like

R> taxa <- c('GCA_001856865|k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Metakosakonia|s__Kluyvera_intestini', 'GCA_001856866|k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Metakosakonia|s__Metakosakonia_madeup')
R> gsub('GCA_[0-9]+\\|', '', taxa)
[1] "k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Metakosakonia|s__Kluyvera_intestini"  
[2] "k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Metakosakonia|s__Metakosakonia_madeup"
1 Like

Thanks @kescobo
Either my brain stopped working or I was trying to overcomplicate matters by attempting to the tree with Unix before reading into R.

That’s not totally insensible - a similar thing can be accomplished with sed or perl, you just need to get the regular expression right. Something like the following (untested)

$ sed -E 's/^GCA_[0-9]+\|//' oldfile.txt > newfile.txt
1 Like

First, the tree file is a godsend. THANK YOU!

Second, it’s not just simple character substitution to do it in the context of the newick file because there’s numbers in many taxonomies.

I reformatted the tree so that the leaves are only the species level taxonomy, removing the rest of the string. If you like you can download it here until I clean up my dropbox. Feel free to send me gifts.

If you have the full taxonomy you can grab the species only easily in bash; eg.
echo $fulltax | rev | cut -d “|” -f 1 | rev > species
or
species=(echo $fulltax | rev | cut -d “|” -f 1 | rev)

I made a separate 2 column file with all the full taxonomic strings and the replacement species text, and ran a while read loop in bash

while read fulltax species;do
sed -i “s/{fulltax}/{species}/g” nwk_file
done < 2column_file

Hi, may I ask where can I find this newick formatted tree provided with MetaPhlAn3? I checked this blog also (metaphlan3 phylogenetic tree · Issue #92 · biobakery/MetaPhlAn · GitHub), but couldn’t find the tree mentioned. I appreciated any suggestions.

Is this the one the tree file that I can used for a tree visulaization? MetaPhlAn/metaphlan/utils at master · biobakery/MetaPhlAn · GitHub

You can find it in the MetaPhlAn repository (https://github.com/biobakery/MetaPhlAn/tree/master/metaphlan/utils), the filename is mpa_v30_CHOCOPhlAn_201901_species_tree.nwk

1 Like

Hi Francesco,

Is there a newer version of the tree available? I have tried using the version linked above to generate UniFrac distances in R but some species from my MetaPhlAn3 outputs are not present in the tree?

Namely:
k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae|g__Collinsella|s__Collinsella_stercoris k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae|g__Enorma|s__[Collinsella]_massiliensis k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Carnobacteriaceae|g__Granulicatella|s__Granulicatella_elegans k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Ruminococcus|s__Ruminococcus_champanellensis k__Bacteria|p__Firmicutes|c__Erysipelotrichia|o__Erysipelotrichales|f__Erysipelotrichaceae|g__Bulleidia|s__Bulleidia_extructa k__Bacteria|p__Proteobacteria|c__Betaproteobacteria|o__Burkholderiales|f__Sutterellaceae|g__Sutterella|s__Sutterella_parvirubra k__Bacteria|p__Synergistetes|c__Synergistia|o__Synergistales|f__Synergistaceae|g__Cloacibacillus|s__Cloacibacillus_evryensis

As far as I can tell we are using the same version of the MPA database “mpa_v30_CHOCOPhlAn_201901”.

All the best,
Calum

1 Like

I have the same question as you(some species are not present in the tree). Did you find a solution? @cazzlewazzle89

Hi :slight_smile: I tried with the tree you mentioned in the linkage mpa_v30_CHOCOPhlAn_201901 Thank you! @fbeghini

However, I had a little question related to the phylum.

For example here : I am interested in the species
GCA 003096855|k Bacteria|p Bacteroidetes|c Bacteroidia|o Bacteroidales|f Bacteroidaceae|g Bacteroides|s Bacteroides galacturonicus

But when I looked into the Phylum level, I have noticed that this species is nearer to species in phylum Firmicutes. Should it in the cluster phylum Bacteroidetes?

I am appreciated for any advice. Thanks in advance!

Best Regards,
Lu

Thanks for your response @kescobo .

Nonetheless, if another newbie in metagenomics as me reads this issue I would like to add the following.

To remove the GCA prefix in each tip label of the MetaPhlan tree you must take into account the following:

library(phyloseq)
#Open the MetaPhlan tree (in Newick format) with the function read_tree()

tree ← read_tree(“mpa_v30_CHOCOPhlAn_201901_species_tree.nwk”)

#The vector to change is tree$tip.label but be careful not to change the class(tree).

Run the next command to achieve @cazzlewazzle89 requirements in order to be able to construct phyloseq object including phylogenetic information:

tree$tip.label ← gsub(‘GCA_[0-9]+\|’, ‘’, tree$tip.label)