Phylogenetic tree for 2000 genomes

Jesse · June 10, 2024, 5:20pm

Hello,
I want to build a phylogenetic tree for ~2000 genomes with high diversity.
I followed the tutorial but not sure what is the best way to do that with phylphlan 3.1.22.
Based on the different examples, it seems that the tree of life is the most relevant one, so instead of downloading the genomes of TOL I put all of my 2000 genomes in the input dir.
I have the following questions:

Is that the right way to do it?
What are the steps to build the tree? I can see that config file set amino acid as the database, why? the input is genomes
The run time takes very long (mapping ~100 genomes took 2 days) and idea how to speed things up?
4.If I want to build it based on nucelic acids how should I update the confing?

This is how the config was built:
phylophlan_write_config_file -d a -o tol_similar.cfg --db_aa diamond --map_dna diamond --map_aa diamond --msa mafft --trim trimal --tree1 iqtree --verbose

Thanks!

f.asnicar · July 11, 2024, 3:49pm

Hello!

So, assuming you already have your 2k genomes.

The “high diversity” preset is thought when you have many genomes (2k, for example) representing different species. Is this the case? Or are your 2k genomes coming from the same family/genus/species?

The config file sets the database to amino acids because if you use the 400 universal markers from PhyloPhlAn (the -d phylophlan param), those are proteins, so the database is amino acids and the config will set up the translated search if you provide genomes as input.

The mapping step can be very slow because of the translated search. You can run prokka on your genomes and provide the proteomes (the .faa files) as input instead of genomes. At that point, the mapping will be AAs <> AAs, and it will be much faster (and if you’re in the high diversity case, you won’t lose much resolution).

If, as you mentioned, you want to do a phylogeny on the nucleotide space, then you need a database of genes used for the mapping. The databases provided within PhyloPhlAn are proteins, so you need to identify the nucleotide sequences of single-copy and conserved genes and then use those as a database. However, in your case, with 2k genomes (assuming the high diversity), I don’t think it will be advisable to do so. It is different if your 2k genomes are all from the same species. In this case, you might want to identify their core genes (for instance, from Roary) and use them as a database to build your species-level phylogeny.

I hope this helps, thank you,
Francesco

Shaelyn · July 11, 2024, 10:51pm

Hello Francesco,

Thank you for the clarification. I’m currently working on a project similar to Jesse’s, where I have analyzed 2000+ genomes from various species using nucleotide sequences as input data. Upon running the PhyloPhlAn code, I encountered warnings and errors indicating “Not enough markers” before building the tree. Could this issue be related to the challenges you mentioned of using nucleotide genomes for phylogenetic analysis across diverse taxa?

Thank you!
Shaelyn

Topic		Replies	Views
Phylophlan Tree of Life Differences PhyloPhlAn	2	663	January 12, 2022
Prokaryotic tree reconstruction pipeline issue PhyloPhlAn	5	585	August 9, 2021
Speeding up PhyloPhlAn (RAxML-HPC step) PhyloPhlAn	3	934	May 18, 2023
Speeding up PhylophlAn PhyloPhlAn	1	413	June 30, 2022
Phylophlan running in loop PhyloPhlAn	1	286	August 26, 2022

Phylogenetic tree for 2000 genomes

Related topics