Humann3 metatranscriptome analysis stuck at nucleotide alignment post processing

Hi All,

I am having trouble getting through a metatranscriptomic analysis with humann3. I am running it on my university’s computing cluster which uses slurm job manager. I am using paired end sequencing data that was previously cleaned with Kneaddata (without any error). I then concatenated the cleaned forward and reverse fastq files for each sample to make a single file per sample containing F+R reads per the instructions on your man page.

For about 2/3 of my samples the analysis completed successfully, however for the other 1/3 of my samples they never make it past the nucleotide alignment step (based on log files, example attached). They seem to complete the nucleotide alignment ok, usually within a few hours, but then the run sits for up to 10+ days (as long as I’ve tried) with apparently nothing happening. I see no updates to logfile, no changes in the size of the aligned reads file and no other files (like unaligned reads) are created. Eventually they just time-out with no apparent progress after the nucleotide alignment step.

I have tried looking back for issues with these particular input files (no errors in clean up step with knead data as mentioned) and I tried re-concatenating the F+R files. But so far I can’t see anything at issue with the input files themselves. The issue does seem to be related to file size as the failed samples tend to be among those with higher read counts, however this is not a perfect correlation as some samples with equivalent reads completed successfully. my files are rather large with those on the higher end of read counts ~80-100 million reads.

Any ideas/help would be greatly appreciated!
Thanks,
Chapman

Log file:
‘’’
03/24/2022 04:20:54 PM - humann.humann - INFO: Running humann v3.0.1
03/24/2022 04:20:54 PM - humann.humann - INFO: Output files will be written to: /users/cbeekman/scratch/GImapping/metaRNA/humann3/humann3_outfiles/RNA_17
03/24/2022 04:20:54 PM - humann.humann - INFO: Writing temp files to directory: /users/cbeekman/scratch/GImapping/metaRNA/humann3/humann3_outfiles/RNA_17/RNA_17_cleanFR_humann_temp
03/24/2022 04:20:54 PM - humann.utilities - INFO: File ( /users/cbeekman/scratch/GImapping/metaRNA/humann3/concat_cleanreads/RNA_17_cleanFR.fastq ) is of format: fastq
03/24/2022 04:20:54 PM - humann.utilities - DEBUG: Check software, metaphlan, for required version, 3.0
03/24/2022 04:20:56 PM - humann.utilities - INFO: Using metaphlan version 3.0
03/24/2022 04:20:56 PM - humann.utilities - DEBUG: Check software, bowtie2, for required version, 2.2
03/24/2022 04:20:56 PM - humann.utilities - INFO: Using bowtie2 version 2.4
03/24/2022 04:20:56 PM - humann.humann - INFO: Search mode set to uniref90 because a uniref90 translated search database is selected
03/24/2022 04:20:56 PM - humann.utilities - DEBUG: Check software, diamond, for required version, 0.9.36
03/24/2022 04:20:56 PM - humann.utilities - INFO: Using diamond version 2.0.14
03/24/2022 04:20:56 PM - humann.config - INFO:
Run config settings:

DATABASE SETTINGS
nucleotide database folder = /users/cbeekman/scratch/GImapping/databases/humann3/chocophlan
protein database folder = /users/cbeekman/scratch/GImapping/databases/humann3/uniref
pathways database file 1 = /users/cbeekman/miniconda2/envs/humann3/lib/python3.9/site-packages/humann/data/pathways/metacyc_reactions_level4ec_only.uniref.bz2
pathways database file 2 = /users/cbeekman/miniconda2/envs/humann3/lib/python3.9/site-packages/humann/data/pathways/metacyc_pathways_structured_filtered_v24
utility mapping database folder = /users/cbeekman/miniconda2/envs/humann3/lib/python3.9/site-packages/humann/data/misc

RUN MODES
resume = False
verbose = False
bypass prescreen = False
bypass nucleotide index = False
bypass nucleotide search = False
bypass translated search = False
translated search = diamond
threads = 48

SEARCH MODE
search mode = uniref90
nucleotide identity threshold = 0.0
translated identity threshold = 80.0

ALIGNMENT SETTINGS
bowtie2 options = --very-sensitive
diamond options = --top 1 --outfmt 6
evalue threshold = 1.0
prescreen threshold = 0.01
translated subject coverage threshold = 50.0
translated query coverage threshold = 90.0
nucleotide subject coverage threshold = 50.0
nucleotide query coverage threshold = 90.0

PATHWAYS SETTINGS
minpath = on
xipe = off
gap fill = on

INPUT AND OUTPUT FORMATS
input file format = fastq
output file format = tsv
output max decimals = 10
remove stratified output = False
remove column description output = False
log level = DEBUG

03/24/2022 04:20:56 PM - humann.store - DEBUG: Initialize Alignments class instance to maximize memory use
03/24/2022 04:20:56 PM - humann.store - DEBUG: Initialize Reads class instance to maximize memory use
03/24/2022 04:21:13 PM - humann.humann - INFO: Load pathways database part 1: /users/cbeekman/miniconda2/envs/humann3/lib/python3.9/site-packages/humann/data/pathways/metacyc_reactions_level4ec_only.uniref.bz2
03/24/2022 04:21:13 PM - humann.humann - INFO: Load pathways database part 2: /users/cbeekman/miniconda2/envs/humann3/lib/python3.9/site-packages/humann/data/pathways/metacyc_pathways_structured_filtered_v24
03/24/2022 04:21:13 PM - humann.search.prescreen - INFO: Found g__Klebsiella.s__Klebsiella_pneumoniae : 54.76% of mapped reads
03/24/2022 04:21:13 PM - humann.search.prescreen - INFO: Found g__Parasutterella.s__Parasutterella_excrementihominis : 23.03% of mapped reads
03/24/2022 04:21:13 PM - humann.search.prescreen - INFO: Found g__Klebsiella.s__Klebsiella_variicola : 18.80% of mapped reads
03/24/2022 04:21:13 PM - humann.search.prescreen - INFO: Found g__Klebsiella.s__Klebsiella_quasipneumoniae : 1.29% of mapped reads
03/24/2022 04:21:13 PM - humann.search.prescreen - INFO: Found g__Proteobacteria_unclassified.s__Proteobacteria_bacterium_CAG_139 : 1.01% of mapped reads
03/24/2022 04:21:13 PM - humann.search.prescreen - INFO: Found g__Klebsiella.s__Klebsiella_michiganensis : 0.86% of mapped reads
03/24/2022 04:21:13 PM - humann.search.prescreen - INFO: Found g__Turicimonas.s__Turicimonas_muris : 0.23% of mapped reads
03/24/2022 04:21:13 PM - humann.search.prescreen - INFO: Found g__Lactobacillus.s__Lactobacillus_johnsonii : 0.02% of mapped reads
03/24/2022 04:21:13 PM - humann.search.prescreen - INFO: Total species selected from prescreen: 8
03/24/2022 04:21:13 PM - humann.search.prescreen - DEBUG: Adding file to database: g__Parasutterella.s__Parasutterella_excrementihominis.centroids.v296_v201901b.ffn.gz
03/24/2022 04:21:13 PM - humann.search.prescreen - DEBUG: Adding file to database: g__Klebsiella.s__Klebsiella_quasipneumoniae.centroids.v296_v201901b.ffn.gz
03/24/2022 04:21:13 PM - humann.search.prescreen - DEBUG: Adding file to database: g__Klebsiella.s__Klebsiella_michiganensis.centroids.v296_v201901b.ffn.gz
03/24/2022 04:21:13 PM - humann.search.prescreen - DEBUG: Adding file to database: g__Turicimonas.s__Turicimonas_muris.centroids.v201901b.ffn.gz
03/24/2022 04:21:13 PM - humann.search.prescreen - DEBUG: Adding file to database: g__Klebsiella.s__Klebsiella_pneumoniae.centroids.v296_v201901b.ffn.gz
03/24/2022 04:21:13 PM - humann.search.prescreen - DEBUG: Adding file to database: g__Proteobacteria_unclassified.s__Proteobacteria_bacterium_CAG_139.centroids.v296_v201901b.ffn.gz
03/24/2022 04:21:13 PM - humann.search.prescreen - DEBUG: Adding file to database: g__Klebsiella.s__Klebsiella_variicola.centroids.v296_v201901b.ffn.gz
03/24/2022 04:21:13 PM - humann.search.prescreen - DEBUG: Adding file to database: g__Lactobacillus.s__Lactobacillus_johnsonii.centroids.v296_v201901b.ffn.gz
03/24/2022 04:21:13 PM - humann.search.prescreen - INFO: Creating custom ChocoPhlAn database …
03/24/2022 04:21:13 PM - humann.utilities - DEBUG: Using software: /usr/bin/gunzip
03/24/2022 04:21:13 PM - humann.utilities - INFO: Execute command: /usr/bin/gunzip -c /users/cbeekman/scratch/GImapping/databases/humann3/chocophlan/g__Parasutterella.s__Parasutterella_excrementihominis.centroids.v296_v201901b.ffn.gz /users/cbeekman/scratch/GImapping/databases/humann3/chocophlan/g__Klebsiella.s__Klebsiella_quasipneumoniae.centroids.v296_v201901b.ffn.gz /users/cbeekman/scratch/GImapping/databases/humann3/chocophlan/g__Klebsiella.s__Klebsiella_michiganensis.centroids.v296_v201901b.ffn.gz /users/cbeekman/scratch/GImapping/databases/humann3/chocophlan/g__Turicimonas.s__Turicimonas_muris.centroids.v201901b.ffn.gz /users/cbeekman/scratch/GImapping/databases/humann3/chocophlan/g__Klebsiella.s__Klebsiella_pneumoniae.centroids.v296_v201901b.ffn.gz /users/cbeekman/scratch/GImapping/databases/humann3/chocophlan/g__Proteobacteria_unclassified.s__Proteobacteria_bacterium_CAG_139.centroids.v296_v201901b.ffn.gz /users/cbeekman/scratch/GImapping/databases/humann3/chocophlan/g__Klebsiella.s__Klebsiella_variicola.centroids.v296_v201901b.ffn.gz /users/cbeekman/scratch/GImapping/databases/humann3/chocophlan/g__Lactobacillus.s__Lactobacillus_johnsonii.centroids.v296_v201901b.ffn.gz
03/24/2022 04:21:14 PM - humann.humann - INFO: TIMESTAMP: Completed custom database creation : 1 seconds
03/24/2022 04:21:14 PM - humann.search.nucleotide - INFO: Running bowtie2-build …
03/24/2022 04:21:14 PM - humann.utilities - DEBUG: Using software: /users/cbeekman/miniconda2/envs/humann3/bin/bowtie2-build
03/24/2022 04:21:14 PM - humann.utilities - INFO: Execute command: /users/cbeekman/miniconda2/envs/humann3/bin/bowtie2-build -f /users/cbeekman/scratch/GImapping/metaRNA/humann3/humann3_outfiles/RNA_17/RNA_17_cleanFR_humann_temp/RNA_17_cleanFR_custom_chocophlan_database.ffn /users/cbeekman/scratch/GImapping/metaRNA/humann3/humann3_outfiles/RNA_17/RNA_17_cleanFR_humann_temp/RNA_17_cleanFR_bowtie2_index
03/24/2022 04:24:26 PM - humann.humann - INFO: TIMESTAMP: Completed database index : 192 seconds
03/24/2022 04:24:26 PM - humann.search.nucleotide - DEBUG: Nucleotide input file is of type: fastq
03/24/2022 04:24:26 PM - humann.utilities - DEBUG: Using software: /users/cbeekman/miniconda2/envs/humann3/bin/bowtie2
03/24/2022 04:24:26 PM - humann.utilities - INFO: Execute command: /users/cbeekman/miniconda2/envs/humann3/bin/bowtie2 -q -x /users/cbeekman/scratch/GImapping/metaRNA/humann3/humann3_outfiles/RNA_17/RNA_17_cleanFR_humann_temp/RNA_17_cleanFR_bowtie2_index -U /users/cbeekman/scratch/GImapping/metaRNA/humann3/concat_cleanreads/RNA_17_cleanFR.fastq -S /users/cbeekman/scratch/GImapping/metaRNA/humann3/humann3_outfiles/RNA_17/RNA_17_cleanFR_humann_temp/RNA_17_cleanFR_bowtie2_aligned.sam -p 48 --very-sensitive
03/24/2022 05:04:18 PM - humann.utilities - DEBUG: b"Warning: skipping read ‘A00261:479:H7VMHDSX3:3:1152:28031:31501:N:0:CAAGGTAC+GAGATACG#0’ because length (1) <= # seed mismatches (0)\nWarning: skipping read ‘A00261:479:H7VMHDSX3:3:1152:28031:31501:N:0:CAAGGTAC+GAGATACG#0’ because it was < 2 characters long\nWarning: skipping read ‘A00261:479:H7VMHDSX3:3:2463:28230:8296:N:0:CAAGGTAC+GAGATACG#0’ because length (1) <= # seed mismatches (0)\nWarning: skipping read ‘A00261:479:H7VMHDSX3:3:2463:28230:8296:N:0:CAAGGTAC+GAGATACG#0’ because it was < 2 characters long\nWarning: skipping read ‘A00261:479:H7VMHDSX3:3:1152:28031:31501:N:0:CAAGGTAC+GAGATACG#0’ because length (1) <= # seed mismatches (0)\nWarning: skipping read ‘A00261:479:H7VMHDSX3:3:1152:28031:31501:N:0:CAAGGTAC+GAGATACG#0’ because it was < 2 characters long\nWarning: skipping read ‘A00261:479:H7VMHDSX3:3:2463:28230:8296:N:0:CAAGGTAC+GAGATACG#0’ because length (1) <= # seed mismatches (0)\nWarning: skipping read ‘A00261:479:H7VMHDSX3:3:2463:28230:8296:N:0:CAAGGTAC+GAGATACG#0’ because it was < 2 characters long\n185480202 reads; of these:\n 185480202 (100.00%) were unpaired; of these:\n 96382098 (51.96%) aligned 0 times\n 14695505 (7.92%) aligned exactly 1 time\n 74402599 (40.11%) aligned >1 times\n48.04% overall alignment rate\n"
03/24/2022 05:04:18 PM - humann.humann - INFO: TIMESTAMP: Completed nucleotide alignment : 2392 seconds

Hi, Thank you for the detailed post. It sounds like you might be running out of memory for the jobs that failed. Can you try increasing the memory allocated to those jobs to see if it resolves the issue?

Thank you,
Lauren

Hi Lauren,

Thanks for the response, I am already maxing out the memory per run on out standard partition. However our cluster does have a large memory partition which will allow a bit more, I will try running there with more memory requested.
However, usually the slurm file would make some reference to memory if the job runs out, these simply say timed out in the slurms…? Are there any any other possibilities that come to mind?

Hi Lauren,

I have tried running my failed jobs with more memory but still having the same issue. Also looking at the stats on the jobs they don’t seem to be using anywhere near the upper limit of memory (or CPUs).

Looking at a log file from one of my successful runs it looks like the next things to run are humann.utilities and humann.search.blastx_coverage. Does this involve a blastx search? It seems like maybe memory and CPU resources are not the issue but that maybe for some reason these next steps are just taking excruciatingly long to execute.

Could you help me understand what should be happening after the nucleotide alignment for any clues that might help me speed things up?

I run into the similar problem, except that my humann3(v3.6) stuck at protein alignment post processing.

I get the recent log below:

12/12/2022 11:19:02 PM - humann.utilities - DEBUG: Using software: /bin/cat
12/12/2022 11:19:02 PM - humann.utilities - INFO: Execute command: /bin/cat /dev/shm/M44B/M44B_humann_temp/tmpsglftnkt/diamond_m8_jfi1qr7t
12/12/2022 11:19:11 PM - humann.humann - INFO: TIMESTAMP: Completed     translated alignment    :        12811   seconds

Threads used: 30
Memory available: 1800G
/dev/shm space available: 1000G

I submit a humann3 job for each of the 134 samples. most of them succeeded within 6 hours, but 4 of them stucked at the same position, and it has been 2 days.

It seems not a IO problem, as the humann process is still running, and the CPU usage is 100%.

Any clues? Thanks! best wishes

Wondering if this could be relevant as it is discussing the same step where my samples are getting stuck?: What is humann doing in the circled area?

The initial log here suggests that nucleotide search is finishing but I don’t see any indication that translated search has begun. I see from the log that you’re running DIAMOND 2.0.14, which isn’t a version we’ve specifically tested against. You might want to try running with the latest v3 versions of MetaPhlAn (3.1) + HUMAnN (3.6) both to resolve this small inconsistency and to take advantage of other updates in the software that we’ve made recently.

Try debug in source code:
Firstly, find the humann path by:

import humann
print(humann.__file__)

Secondly, try to add some output code. For example, add print function in humann/search/nucletide.py line ~290 and ~380 where humann processes file reading and data processing by SAM line:

print('done file readx')



Thirdly, run humann and wait the output information.
Finally, keep checking like this till you find out where the program is stuck in.

Hi Eric,

Thanks for your response. I tried a test run after making a new environment and installing the update (humann 3.6) but still the same issue. I had also brought this to the attention of our help team for our computing cluster and they did some investigating and found the following which they thought was strange?:

From admin for our computing cluster:

"Out of curiosity, I traced the program execution and found the issue is the function call
allowed_genes = blastx_coverage.blastx_coverage(reduced_aligned_reads_file,
config.nucleotide_subject_coverage_threshold, alignments, log_messages=True, apply_filter=True,
nucleotide=True, query_coverage_threshold=config.nucleotide_query_coverage_threshold,
identity_threshold = config.nucleotide_identity_threshold)

The is called by the unaligned_reads function in search/nucleotide.py, which was called by the main function in humann.py.

The problem is that the time keeps on increasing for processing 100,000 lines of RNA_17_cleanFR_bowtie2_aligned.tsv: from about 20 seconds when the first line was processed to 9 minutes when the 3,550,000 line was processed.
12/20/2022 09:44:25 AM - humann.utilities - INFO: lineno = 100000
12/20/2022 09:44:27 AM - humann.utilities - INFO: lineno = 200000
12/20/2022 09:44:29 AM - humann.utilities - INFO: lineno = 300000
12/20/2022 09:44:32 AM - humann.utilities - INFO: lineno = 400000
12/20/2022 09:44:36 AM - humann.utilities - INFO: lineno = 500000

12/20/2022 10:58:00 AM - humann.utilities - INFO: lineno = 10000000
12/20/2022 10:59:37 AM - humann.utilities - INFO: lineno = 10100000
12/20/2022 11:01:18 AM - humann.utilities - INFO: lineno = 10200000
12/20/2022 11:03:03 AM - humann.utilities - INFO: lineno = 10300000
12/20/2022 11:04:50 AM - humann.utilities - INFO: lineno = 10400000
12/20/2022 11:06:40 AM - humann.utilities - INFO: lineno = 10500000
12/20/2022 11:08:31 AM - humann.utilities - INFO: lineno = 10600000

12/21/2022 08:45:06 AM - humann.utilities - INFO: lineno = 35300000
12/21/2022 08:54:12 AM - humann.utilities - INFO: lineno = 35400000
12/21/2022 09:03:19 AM - humann.utilities - INFO: lineno = 35500000

You may contact the developer why this happened."

At this point I guess I just want to know whether you think something is definitely unexpected / going wrong here or if the extremely long processing time (more than 2 weeks) is to be expected given the size of my input fastq files (failed samples have concatenated F+R read files sizes of 50-70gb, and up to 100 million+ reads). I know these are large files, but I don’t think such depth is unheard of for metatranscriptomic analyses. Would you expect humann to take weeks or more to process such samples? Or more specifically to conduct the nucleotide alignment postprocessing, since this where the runs get stuck at?

Hi, Thank you for trying an increase in memory. How much memory was the process using vs how much memory was it allocated? Were other processes running on the same node? How much unused memory was available on the node? In the cases where I have seen this in the past it has been a process swapping with not enough memory.

Also I would look to see the total number of alignments for the samples that ran quickly vs those that took longer to run. The information can be found in the log in a line that looks like the following:

185480202 reads; of these:\n 185480202 (100.00%) were unpaired; of these:\n 96382098 (51.96%) aligned 0 times\n 14695505 (7.92%) aligned exactly 1 time\n 74402599 (40.11%) aligned >1 times\n48.04% overall alignment rate\n

If they have approximately the same number of alignments check to see if some samples have more multiple alignments per read.

Thank you!
Lauren

Hi Lauren,

Thanks for the reply. When I used seff command on the job it indicates that the job (for one of the timed-out samples I have been using for trouble-shooting) used 138 Gb and was allocated 375 Gb. I don’t believe anything else would have been running on the node as 385 Gb is the most memory any of the nodes had and I requested 375.

I don’t think I have the alignment files anymore for the rest of them as I removed intermediate files due to limited storage availability. However, I did let the sample I have been trouble shooting run longer and it did actually finish after ~12 days. I think I am just going to have to accept that and run the timed-out samples with as much time allocated as possible. Unfortunately this may take months as the memory limits may prevent me from running more than a few jobs at once.

I’m not sure if this is just the reality for deeper sequencing experiments but if you do find a way to speed up the nucleotide post-processing step please let me know.

-Thanks

Thank you for the follow up. Is there anything different between this sample and the others than ran much faster? For example, is this fastq file much larger? Do more reads pass to translated search? Are there a lot more translated search alignments?

Thank you,
Lauren

Hi Lauren,

I was able to get my samples to finish by simply giving them more time (20+ days for a couple of them). Again the bottleneck for all samples was the nucleotide post-processing step that occurs immediately after the nucleotide alignment and before the translated search.

There does seem to be a relationship with the overall number of alignments. I have actually tried running my samples with the standard workflow as well as a custom nucleotide database. The same issue occured with both workflows. However looking back at the runs with the standard workflow, all those that got stuck at the nucleotide postprocessing step seemed to have much higher alignment rates in the nucleotide alignment.

For the custom database this relationship wasn’t as clear but I have a feeling that it is because those reads that aligned multiple times were aligning to MANY genes in the database (the alignment files tended to be very large, though again I had to delete the intermediate files as the runs completed to make storage space so I can’t be sure that there is a direct relationship).

Please let me know if you figure out what is happening and a way to speed this step up for future analyses.

Thanks,
Chapman

Hi Chapman, Thank you for the follow up. Yes, I think that totally makes sense and it was what I was guessing might be happening. If you have reads with multiple alignments and those alignments are to many genes it would cause the alignment database to grow and also cause the processing time to take a lot longer then other runs. We don’t see these cases a lot but when we do I see the runs take a lot of time. I put this on our development list to look into these cases to see if we can speed them up.

Thanks again for the feedback,
Lauren