What is the approximate running time of a sample and how to reduce the running time

I am using humann3.9 and metaphlan4.1 for fastq file processing. I want to know if it is normal to process a fastq.gz file of about 5GB that takes close to 12 hours. Moreover, I have 2000 such samples, how can I reduce the overall running time of my project. Here are the parameters:
humann --input /public/home/CXZX03/perl5/3_tasks/1_metaphlan/demo/AA0001.fq.gz --threads 24 --search-mode uniref90 --remove-temp-output --nucleotide-database /public/home/CXZX03/perl5/2_data_base/humann/chocophlan --protein-database /public/home/CXZX03/perl5/2_data_base/humann/uniref_90 --output /public/home/CXZX03/perl5/3_tasks/2_humann/output/tmp --metaphlan-options=“–bowtie2db /public/home/CXZX03/perl5/2_data_base/Metaphlan4/vJun23”

For comparison, a 10M read sample metagenome I use for testing takes about 40 CPU hours running in pure translated search mode. The tiered workflow takes 16 CPU hours and bypassing translated search (so HUMAnN stops after nucleotide search) takes 3 CPU hours.

12 hours on 24 threads is 288 CPU hours, which would only make sense if 1) your sample was much larger than mine and 2) highly uncharacterized (such that most of the work is being done in the translated search phase). Are either/both of those statements true?

Incidentally, when multithreading read mapping tools I tend to max out at 8 threads, since my experience has been that the performance improvement is highly sublinear in the number of threads used. The stats I cited above were based on a run with 8 threads.

1 Like