Hello Mick,
So, I guess it should take less than one week for 2.6k genomes. One thing you can try is to skip the conversion to amino acids using the --force_nucleotides
, assuming your inputs (phylophlan_input
) are all genomes, if you have a mix of genomes and proteomes then you can’t. If you do this you also have to create a new config file specifying the same option.
Another thing you can do is to avoid a two-step phylogeny reconstruction as the default in the configuration file. The standard config file will have a [tree1]
section specifying FastTree and a [tree2]
section specifying RAxML. You can create a new config file with only the [tree1]
section with RAxML.
From the --diversity low --fast
combination (PhyloPhlAn wiki - Accurate or Fast), you can notice that the --subsample
param is set to fivehundred
. This can be set in the command line to override the default value with any of the options as detailed in PhyloPhlAn wiki - Subsampling, for instance, threehundred
or twentyfivepercent
can reduce the length of the MSA and speed up the phylogeny reconstruction step, or you can even use phylophlan
since you’re using -d phylophlan
.
The other obvious thing would be to increase the number of CPUs, but that depends on the hardware available, also PhyloPhlAn will max to 20 when running RAxML, so from 16 to 20, I don’t think it will change a lot in your case.
Please let me know if these are of any help.
Many thanks,
Francesco
Notes:
I just wanted to point out that you don’t need to specify this:
--maas phylophlan/phylophlan_substitution_models/phylophlan.tsv
As this is needed when you run a gene tree pipeline, but from the name of your configuration (supermatrix_aa.cfg
), I’m guessing you’re running a concatenation pipeline.
And also these:
--remove_fragmentary_entries --fragmentary_threshold 0.85
As they are already specified with the --diversity low --fast
combination (PhyloPhlAn wiki - Accurate or Fast).