What is humann doing in the circled area?

I found humann takes a low rate usage of CPU in the circled time range. However, It’s not easy for me to find the reason in a short time. I wish I could know what mission was HUMAnN doing in the circled time range. And by which method can I solve it.
Thanks. :blush:
Command we used:

time humann -i xxx.fq -o xxx --threads x --o-log xx.log --nucleotide-database xxx/chocophlan --bypass-translated-search --output-basename xxx --remove-column-description-output --memory-use minimum
1>/dev/null 2>/dev/null

The status statics of server:

This is an interesting analysis. :slight_smile: To know for sure you could compare the timestamps in the HUMAnN log file to the resource utilization you’re seeing here. My guess is that the circled phase is the post-processing of the nucleotide search results (with the actual nucleotide search being the CPU-heavy phase before it and the translated search as the CPU-heavy phase that follows). Please let us know what you find!

It’s nice, I found that log:

09/06/2022 12:30:09 AM - humann.humann - INFO: TIMESTAMP: Completed     custom database creation        :        19      seconds
09/06/2022 12:30:09 AM - humann.search.nucleotide - INFO: Running bowtie2-build ........
09/06/2022 12:30:09 AM - humann.utilities - DEBUG: Using software: /share/app/miniconda3/bin/bowtie2-build
09/06/2022 12:30:09 AM - humann.utilities - INFO: Execute command: /share/app/miniconda3/bin/bowtie2-build -f /output/single_sample/C30/C30_humann_temp_cw8dtgnf/C30_custom_chocophlan_database.ffn /output/single_sample/C30/C30_humann_temp_cw8
dtgnf/C30_bowtie2_index
09/06/2022 12:40:02 AM - humann.humann - INFO: TIMESTAMP: Completed     database index  :        593     seconds
09/06/2022 12:40:02 AM - humann.search.nucleotide - DEBUG: Nucleotide input file is of type: fastq
09/06/2022 12:40:02 AM - humann.utilities - DEBUG: Using software: /share/app/miniconda3/bin/bowtie2
09/06/2022 12:40:02 AM - humann.utilities - INFO: Execute command: /share/app/miniconda3/bin/bowtie2 -q -x /output/single_sample/C30/C30_humann_temp_cw8dtgnf/C30_bowtie2_index -U /output/single_sample/C30/C30.fq -S /output/single_sample/C30/
C30_humann_temp_cw8dtgnf/C30_bowtie2_aligned.sam -p 32 --very-sensitive
09/06/2022 01:28:26 AM - humann.utilities - DEBUG: b'65289099 reads; of these:\n  65289099 (100.00%) were unpaired; of these:\n    25431901 (38.95%) aligned 0 times\n    18797478 (28.79%) aligned exactly 1 time\n    21059720 (32.26%) aligned
 >1 times\n61.05% overall alignment rate\n'
09/06/2022 01:28:26 AM - humann.humann - INFO: TIMESTAMP: Completed     nucleotide alignment    :        2904    seconds
09/06/2022 02:06:05 AM - humann.utilities - DEBUG: Total alignments where percent identity is not a number: 0
09/06/2022 02:06:05 AM - humann.utilities - DEBUG: Total alignments where alignment length is not a number: 0
09/06/2022 02:06:05 AM - humann.utilities - DEBUG: Total alignments where E-value is not a number: 0
09/06/2022 02:06:05 AM - humann.utilities - DEBUG: Total alignments not included based on large e-value: 0
09/06/2022 02:06:05 AM - humann.utilities - DEBUG: Total alignments not included based on small percent identity: 0
09/06/2022 02:06:05 AM - humann.utilities - DEBUG: Total alignments not included based on small query coverage: 0
09/06/2022 02:06:05 AM - humann.search.blastx_coverage - INFO: Total alignments without coverage information: 0
09/06/2022 02:06:05 AM - humann.search.blastx_coverage - INFO: Total proteins in blastx output: 239711
09/06/2022 02:06:05 AM - humann.search.blastx_coverage - INFO: Total proteins without lengths: 0
09/06/2022 02:06:05 AM - humann.search.blastx_coverage - INFO: Proteins with coverage greater than threshold (50.0): 203594
09/06/2022 02:39:47 AM - humann.search.nucleotide - DEBUG: Total nucleotide alignments not included based on filtered genes: 285273
09/06/2022 02:39:47 AM - humann.search.nucleotide - DEBUG: Total nucleotide alignments not included based on small percent identities: 14
09/06/2022 02:39:47 AM - humann.search.nucleotide - DEBUG: Total nucleotide alignments not included based on query coverage threshold: 0
09/06/2022 02:39:47 AM - humann.search.nucleotide - DEBUG: Keeping sam file
09/06/2022 02:39:51 AM - humann.humann - INFO: TIMESTAMP: Completed     nucleotide alignment post-processing    :        4285    seconds
09/06/2022 02:39:51 AM - humann.humann - INFO: Total bugs from nucleotide alignment: 75
09/06/2022 02:39:51 AM - humann.humann - INFO: 
g__Blautia.s__Blautia_obeum: 1207370 hits
g__Ruminococcus.s__Ruminococcus_bromii: 854761 hits
g__Bacteroides.s__Bacteroides_stercoris: 4037848 hits

Hi, franzosa:
I found one thing strange, which is:
If you replace “re.search(‘^@’, xxx)” with “xxx.startswith(‘@’)” in the following places(and not only these places), the time usage would be reduced a lot.

  • humann/utilities.py: in about line 1307;
  • humann/search/nucletide.py: in about line 260 and 309.

In my thought, it would reduce a small piece of time. However, in the tests, the time were reduced from 748.8 s to 263 s in these lines. The command I used:time humann -i abcde_bowtie2_aligned.sam --input-format sam -o humann_out_bcdef --taxonomic-profile metaphlan_out_cdefg.tsv --bypass-translated-search --nucleotide-database /database/humann/chocophlan
The input SAM file was about 490 Mb.

The 3 results, with checking by lines, were all the same but the first annotate lines.
So I wanna know whether is this method feasible?
Thank you very much.

Sorry, it’s my mistake. The size of raw file with suffix “fq.gz” is 490 Mb, not the SAM file.

Nice speedup. I’ll add it to our development list!

Hi @franzosa ,
I think slice would be faster as function calling takes more time in Python.
It could be used like: a[0] == '@'.
Or you could try just add two optional arguments “–no-head --no-seq” to bowtie2 and then there would be no “@”.

Nice ideas - I’ve added them to our dev list! :slight_smile: