Speeding up PhyloPhlAn (RAxML-HPC step)

Hi

We often need to create very large phylogenetic trees of genomes and the RAxML-HPC step of PhyloPhlAn often takes many days, sometimes stretching into weeks.

Is this normal and to be expected? Are there things we can do to speed it up?

The current job has been running for ~6 days with 2696 user genomes with options

phylophlan -i phylophlan_input -o phylophlan_output -d phylophlan -t a -f phylophlan_configs/supermatrix_aa.cfg --nproc 16 --diversity low --fast --verbose --maas phylophlan/phylophlan_substitution_models/phylophlan.tsv --remove_fragmentary_entries --fragmentary_threshold 0.85

The RAxML job that PhyloPhlAn is running is this:

raxmlHPC-PTHREADS-SSE3 -p 1989 -m PROTCATLG -T 16 -t phylophlan_output/phylophlan_input_resolved.tre -w phylophlan_output -s phylophlan_output/phylophlan_input_concatenated.aln -n phylophlan_input_refined.tre

Any help appreciated :slight_smile:

Thanks
Mick

Hello Mick,

So, I guess it should take less than one week for 2.6k genomes. One thing you can try is to skip the conversion to amino acids using the --force_nucleotides, assuming your inputs (phylophlan_input) are all genomes, if you have a mix of genomes and proteomes then you can’t. If you do this you also have to create a new config file specifying the same option.

Another thing you can do is to avoid a two-step phylogeny reconstruction as the default in the configuration file. The standard config file will have a [tree1] section specifying FastTree and a [tree2] section specifying RAxML. You can create a new config file with only the [tree1] section with RAxML.

From the --diversity low --fast combination (PhyloPhlAn wiki - Accurate or Fast), you can notice that the --subsample param is set to fivehundred. This can be set in the command line to override the default value with any of the options as detailed in PhyloPhlAn wiki - Subsampling, for instance, threehundred or twentyfivepercent can reduce the length of the MSA and speed up the phylogeny reconstruction step, or you can even use phylophlan since you’re using -d phylophlan.

The other obvious thing would be to increase the number of CPUs, but that depends on the hardware available, also PhyloPhlAn will max to 20 when running RAxML, so from 16 to 20, I don’t think it will change a lot in your case.

Please let me know if these are of any help.

Many thanks,
Francesco


Notes:
I just wanted to point out that you don’t need to specify this:

--maas phylophlan/phylophlan_substitution_models/phylophlan.tsv

As this is needed when you run a gene tree pipeline, but from the name of your configuration (supermatrix_aa.cfg), I’m guessing you’re running a concatenation pipeline.

And also these:

--remove_fragmentary_entries --fragmentary_threshold 0.85

As they are already specified with the --diversity low --fast combination (PhyloPhlAn wiki - Accurate or Fast).

Hi, we’re using phylophlan/3.0.3 and experimenting some troubles regarding execution time. We are analyzing 2.5k bacterial genomes (all Bacillus) and our command is:

phylophlan \
    --force_nucleotides \
    -i seqs/ \
    -d phylophlan \
    -f supermatrix_aa.cfg \
    --diversity low \
    --fast \
    -o analysis \
    --nproc 128 \
    --verbose 2>&1 | tee logs/phylophlan.log

It’s been running for 6 days now and the time limit in the cluster we’re using is 7 days. Would it be possible to use raxmlHPC-MPI instead of raxmlHPC-PTHREADS? Or if you have any other suggestions on how to speed it up, it would be greatly appreciated.

Many thanks in advance
Luis

Dear Luis,

I think you can plugin raxmlHPC-MPI, you’ll just need to manually edit the config file specifying the executable and the right params.
Alternatively, considering the large number of genomes in your tree, you can try using IQ-TREE instead of RAxML, as for large trees it seems to be faster.

Please let me know if you managed to get your phylogeny!

Thanks,
Francesco