Optimize run time in Phylophlan3

Hi,

Thanks for the great program of Phylophlan.
In a recent analysis, I operated under low diversity accurate mode to infer phylogeny of 29 MAGs. I notice that each step in which b6o.bkp files are generated takes from 10 to 20 hours.

"G4-bin5.b6o.bkp" generated in 636s
Mapping "/home/gnii0001/pc77/gaofeng/phylogenetics/result/mycobacterium_phylophlan/tmp/clean_dna/GCA_000195955.2_ASM19595v2_genomic.fna"
"G4-bin2.b6o.bkp" generated in 821s
Mapping "/home/gnii0001/pc77/gaofeng/phylogenetics/result/mycobacterium_phylophlan/tmp/clean_dna/G4-bin1.fna"
"G5-bin6.b6o.bkp" generated in 1153s
Mapping "/home/gnii0001/pc77/gaofeng/phylogenetics/result/mycobacterium_phylophlan/tmp/clean_dna/G5-bin3.fna"
"G5-bin4.b6o.bkp" generated in 1275s
Mapping "/home/gnii0001/pc77/gaofeng/phylogenetics/result/mycobacterium_phylophlan/tmp/clean_dna/GCA_022430545.1_ASM2243054v1_genomic.fna"
"G4-bin4.b6o.bkp" generated in 1593s
Mapping "/home/gnii0001/pc77/gaofeng/phylogenetics/result/mycobacterium_phylophlan/tmp/clean_dna/GCA_002086455.1_ASM208645v1_genomic.fna"
"ur-bin14.b6o.bkp" generated in 1601s
Mapping "/home/gnii0001/pc77/gaofeng/phylogenetics/result/mycobacterium_phylophlan/tmp/clean_dna/G5-bin5.fna"
"G5-bin3.b6o.bkp" generated in 1000s
Mapping "/home/gnii0001/pc77/gaofeng/phylogenetics/result/mycobacterium_phylophlan/tmp/clean_dna/GCA_001457455.1_NCTC11397_genomic.fna"
"G4-bin1.b6o.bkp" generated in 7059s
Mapping "/home/gnii0001/pc77/gaofeng/phylogenetics/result/mycobacterium_phylophlan/tmp/clean_dna/G4-bin3.fna"
"G5-bin2.b6o.bkp" generated in 8951s
Mapping "/home/gnii0001/pc77/gaofeng/phylogenetics/result/mycobacterium_phylophlan/tmp/clean_dna/GCA_002086515.1_ASM208651v1_genomic.fna"
"GCA_002101775.1_ASM210177v1_genomic.b6o.bkp" generated in 15077s
"GCA_002086515.1_ASM208651v1_genomic.b6o.bkp" generated in 7232s
"G5-bin5.b6o.bkp" generated in 19368s
"GCA_002086285.1_ASM208628v1_genomic.b6o.bkp" generated in 22970s
"GCA_018455725.1_ASM1845572v1_genomic.b6o.bkp" generated in 23438s
"G4-bin3.b6o.bkp" generated in 19439s
"GCA_002101585.1_ASM210158v1_genomic.b6o.bkp" generated in 29428s
"GCA_002102395.1_ASM210239v1_genomic.b6o.bkp" generated in 43347s
"GCA_002102265.1_ASM210226v1_genomic.b6o.bkp" generated in 43984s
"GCA_002101655.1_ASM210165v1_genomic.b6o.bkp" generated in 48623s
"GCA_002086405.1_ASM208640v1_genomic.b6o.bkp" generated in 50661s
"GCA_002101955.1_ASM210195v1_genomic.b6o.bkp" generated in 53842s
"GCA_002086455.1_ASM208645v1_genomic.b6o.bkp" generated in 54478s
"GCA_001021505.1_ASM102150v1_genomic.b6o.bkp" generated in 70139s
"GCA_002101885.1_ASM210188v1_genomic.b6o.bkp" generated in 73144s
"GCA_000295855.1_ASM29585v1_genomic.b6o.bkp" generated in 73656s

In comparison, In a previous analysis I inferred phylum level tree under high diversity setting, each step took about 10 minutes, and the entire phylophlan run was able to be completed under 40 hours if I’m not mistaken.

...
"ant.bin.451.b6o.bkp" generated in 458s
"ant.bin.107.b6o.bkp" generated in 498s
"swi.bin.233.b6o.bkp" generated in 414s
"ant.bin.464.b6o.bkp" generated in 456s
"swi.bin.25.b6o.bkp" generated in 392s
"ant.bin.293.b6o.bkp" generated in 424s
"ant.bin.129.b6o.bkp" generated in 523s
...

It does appear similar to this thread: Phylophlan is running too slow when mapping DNA

Confirming my diamond is diamond version 2.0.15 in all the above-mentioned analysis.
Is reversing to DIAMOND 0.9.24 an option? Or use a different aligner?
Will use --fast mode solve this issue from root?

My command:

#!/bin/bash
#SBATCH --job-name=wgt_mycobacterium
#SBATCH --partition=m3g,comp,m3m,m3j,m3h
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20
#SBATCH --time=7-00:00:00
#SBATCH --output=/home/gnii0001/pc77/gaofeng/phylogenetics/log/mycobacterium.out
#SBATCH --error=/home/gnii0001/pc77/gaofeng/phylogenetics/log/mycobacterium.err
#SBATCH --export=ALL

source /fs04/rp24/gaofeng/tools/mambaforge/etc/profile.d/mamba.sh
source /fs04/rp24/gaofeng/tools/mambaforge/etc/profile.d/conda.sh
mamba activate phylophlan

phylophlan \
--input_folder /home/gnii0001/02_acidic_nitritation/data/MAGs/mycobacterium \
--output_folder /home/gnii0001/pc77/gaofeng/phylogenetics/result \
--nproc 20 \
--diversity low \
-d phylophlan \
-f /home/gnii0001/rp24/gaofeng/tools/Miniconda3/envs/phylophlan/bin/configs/supermatrix_aa.cfg \
-i mycobacterium
mamba deactivate

My config:

[db_aa]
program_name = diamond
params = makedb
input = --in
output = --db
version = version
command_line = #program_name# #params# #input# #output#

[map_dna]
program_name = diamond
params = blastx --quiet --outfmt 6 --more-sensitive --id 50 --max-hsps 35 -k 0
input = --query
database = --db
output = --out
version = version
command_line = #program_name# #params# #input# #database# #output#

[map_aa]
program_name = diamond
params = blastp --quiet --outfmt 6 --more-sensitive --id 50 --max-hsps 35 -k 0
input = --query
database = --db
output = --out
version = version
command_line = #program_name# #params# #input# #database# #output#

[msa]
program_name = mafft
params = --quiet --anysymbol  --auto
version = --version
command_line = #program_name# #params# #input# > #output#
environment = TMPDIR=/tmp

[trim]
program_name = trimal
params = -gappyout
input = -in
output = -out
version = --version
command_line = #program_name# #params# #input# #output#

[tree1]
program_name = iqtree
params = -quiet -nt AUTO --alrt 1000 -B 1000 -m TEST
input = -s
output = -pre
version = -version
command_line = #program_name# #params# #input# #output#

Many thanks!
Gaofeng

Hi, just wondering if there’s solution for this?
Will downgrading diamond help?
Thanks

Hi both, the problem here is that the translated mapping from diamond with newer versions (>2) is much slower due to a bug correction from previous versions that were not providing in output all hits.
I haven’t tested it myself in terms of running times, but a possibility could be to switch to blast for the translating mapping. You can have a look at the phylophlan_write_config_file script to generate the proper config files.

Thanks,
Francesco