Hello,
I want to build a phylogenetic tree for ~2000 genomes with high diversity.
I followed the tutorial but not sure what is the best way to do that with phylphlan 3.1.22.
Based on the different examples, it seems that the tree of life is the most relevant one, so instead of downloading the genomes of TOL I put all of my 2000 genomes in the input dir.
I have the following questions:
Is that the right way to do it?
What are the steps to build the tree? I can see that config file set amino acid as the database, why? the input is genomes
The run time takes very long (mapping ~100 genomes took 2 days) and idea how to speed things up?
4.If I want to build it based on nucelic acids how should I update the confing?
This is how the config was built:
phylophlan_write_config_file -d a -o tol_similar.cfg --db_aa diamond --map_dna diamond --map_aa diamond --msa mafft --trim trimal --tree1 iqtree --verbose
The “high diversity” preset is thought when you have many genomes (2k, for example) representing different species. Is this the case? Or are your 2k genomes coming from the same family/genus/species?
The config file sets the database to amino acids because if you use the 400 universal markers from PhyloPhlAn (the -d phylophlan param), those are proteins, so the database is amino acids and the config will set up the translated search if you provide genomes as input.
The mapping step can be very slow because of the translated search. You can run prokka on your genomes and provide the proteomes (the .faa files) as input instead of genomes. At that point, the mapping will be AAs <> AAs, and it will be much faster (and if you’re in the high diversity case, you won’t lose much resolution).
If, as you mentioned, you want to do a phylogeny on the nucleotide space, then you need a database of genes used for the mapping. The databases provided within PhyloPhlAn are proteins, so you need to identify the nucleotide sequences of single-copy and conserved genes and then use those as a database. However, in your case, with 2k genomes (assuming the high diversity), I don’t think it will be advisable to do so. It is different if your 2k genomes are all from the same species. In this case, you might want to identify their core genes (for instance, from Roary) and use them as a database to build your species-level phylogeny.
Thank you for the clarification. I’m currently working on a project similar to Jesse’s, where I have analyzed 2000+ genomes from various species using nucleotide sequences as input data. Upon running the PhyloPhlAn code, I encountered warnings and errors indicating “Not enough markers” before building the tree. Could this issue be related to the challenges you mentioned of using nucleotide genomes for phylogenetic analysis across diverse taxa?