Prokaryotic tree reconstruction pipeline issue

Dear Phylophlan team,

I was trying to reconstruct the prokaryotic tree by following the github tutorial. I started by downloading one genome from each species using this command phylophlan_get_reference -g all -o $genomes -n 1, and ended up with 17509 genomes. I realized that these are DNA sequences, whereas the marker gene database is composed of amino acid sequences, which means I have to run Prodigal on these genomes to determine their amino acid sequences. This process would take a really long time to complete, and also might not be entirely necessary. I am wondering if there is a better way to complete this task? (i.e. to avoid the translation step). If the step is inevitable, can I get some perspective regarding how long it would take (including the rest of the pipeline). Thank you!

Best,
Rui

Dear Rui,

Thanks for using PhyloPhlAn. No need to infer proteins from your input genomes. We designed PhyloPhlAn to work with both genomes and proteomes when using a database of proteins. You just need to make sure you have both the map_dna and map_aa sections in your config file (they are in the default configs PhyloPhlAn creates during installation) using a software that can handle a proteins database and translated search, like diamond.

Please, you an find more info in the wiki (look at the Input Files and Configuration File sections) and in the Prokaryotes Tree of life reconstruction tutorial you can find the command to generate a configuration file.

Let me know if something is not clear.

Thanks,
Francesco

1 Like

Thank you @f.asnicar! This cleared up all my confusion!

Just a follow-up question. I imagine the final tree would have branch names that are the same as the downloaded assembly files. For example, GCA_002407065.fna.gz. If I were to replace the assembly ID with their taxonomy information, how can I do that? Should I just download the taxonomy information from the NCBI assembly database and replace the names in the tree file? Or are there better ways to go about it? Much appreciated!

Yes, leaves will have the name of the input genomes. You don’t need to look for their taxonomic labels in NCBI though because you already have that info where you ran phylophlan_get_reference. You’ll find there a file named taxa2genomes_cpa201901_up201901.txt.bz2 that PhyloPhlAn downloaded to download the genomes. You can get the taxonomy of the downloaded genomes from that file.
If you want then to change the leaves’ name of the tree you can edit the leaves’ labels with Python (using Phylo from Biopython to manage phylogenies).

Alternatively, I do have a draft script I can point you to, in case you want, that can edit nodes’ labels by taking an input mapping file.

I hope this helps, thanks,
Francesco

Awesome, thank you for the clarification and direction! And I’d love to learn more from the draft script if you are willing to kindly share it, thanks again!

Edit: sorry I think I should try the above methods first before taking your script, I will let you know if I encounter any issue! Thank you!

Yeah, sure, no problem.

It is just that the repo is a collection of random scripts (some of them quite old) and I was keeping it not public because it is not very well organized.

I just now made it public and you can find it here: GitHub - fasnicar/utils.
The script you want to use is tree_replace_node_labels.py, there is no much documentation, but I hope that the command line help is sufficiently clear to allow you to use it.

Thanks,
Francesco

1 Like