The bioBakery help forum

Speeding up PhyloPhlAn (RAxML-HPC step)

Hi

We often need to create very large phylogenetic trees of genomes and the RAxML-HPC step of PhyloPhlAn often takes many days, sometimes stretching into weeks.

Is this normal and to be expected? Are there things we can do to speed it up?

The current job has been running for ~6 days with 2696 user genomes with options

phylophlan -i phylophlan_input -o phylophlan_output -d phylophlan -t a -f phylophlan_configs/supermatrix_aa.cfg --nproc 16 --diversity low --fast --verbose --maas phylophlan/phylophlan_substitution_models/phylophlan.tsv --remove_fragmentary_entries --fragmentary_threshold 0.85

The RAxML job that PhyloPhlAn is running is this:

raxmlHPC-PTHREADS-SSE3 -p 1989 -m PROTCATLG -T 16 -t phylophlan_output/phylophlan_input_resolved.tre -w phylophlan_output -s phylophlan_output/phylophlan_input_concatenated.aln -n phylophlan_input_refined.tre

Any help appreciated :slight_smile:

Thanks
Mick

Hello Mick,

So, I guess it should take less than one week for 2.6k genomes. One thing you can try is to skip the conversion to amino acids using the --force_nucleotides, assuming your inputs (phylophlan_input) are all genomes, if you have a mix of genomes and proteomes then you can’t. If you do this you also have to create a new config file specifying the same option.

Another thing you can do is to avoid a two-step phylogeny reconstruction as the default in the configuration file. The standard config file will have a [tree1] section specifying FastTree and a [tree2] section specifying RAxML. You can create a new config file with only the [tree1] section with RAxML.

From the --diversity low --fast combination (PhyloPhlAn wiki - Accurate or Fast), you can notice that the --subsample param is set to fivehundred. This can be set in the command line to override the default value with any of the options as detailed in PhyloPhlAn wiki - Subsampling, for instance, threehundred or twentyfivepercent can reduce the length of the MSA and speed up the phylogeny reconstruction step, or you can even use phylophlan since you’re using -d phylophlan.

The other obvious thing would be to increase the number of CPUs, but that depends on the hardware available, also PhyloPhlAn will max to 20 when running RAxML, so from 16 to 20, I don’t think it will change a lot in your case.

Please let me know if these are of any help.

Many thanks,
Francesco


Notes:
I just wanted to point out that you don’t need to specify this:

--maas phylophlan/phylophlan_substitution_models/phylophlan.tsv

As this is needed when you run a gene tree pipeline, but from the name of your configuration (supermatrix_aa.cfg), I’m guessing you’re running a concatenation pipeline.

And also these:

--remove_fragmentary_entries --fragmentary_threshold 0.85

As they are already specified with the --diversity low --fast combination (PhyloPhlAn wiki - Accurate or Fast).