Bowtie2 unaligned reads slow

Hello,

I would like to use humann3 on RNAseq data from a defined microbial community and am a bit concerned with how long it is taking to run and whether I am doing something wrong or that can be improved. It seems to be bottlenecked at the nucleotide alignment post-processsing step, where bowtie2 is writing unaligned reads to a .fa file to be used in the translated search step. After >12 hours, there is still no end in sight (though the unaligned.fa file is continuously updated over this period).

My input command is below (I have 8 files, which contain the concatenated F and R reads, each file is ~3-5 GB):
for f in *.fastqsanger; do humann -i $f -o results-batch --taxonomic-profile 14SM_abundance.tsv --bypass-translated-search --remove-temp-output --resume --verbose;done >out.log 2>out.err

I skipped metaphlan and used a taxonomic profile based on 16S seq results (I used average profile for all samples since this is just used to build the custom chocophlan database–from what I can tell, the actual abundances are not used by humann). During nucleotide alignment, 75% of reads are mapped overall (50% uniquely, 25% >1 time). To my knowledge, this is reasonable, so should be fine to proceed. Since the 14 bacteria are well-characterised, I opted to use the --bypass-translated-search option to reduce the run time, so the unaligned file is not even that important except to confirm that I am not losing reads that should map to the 14 bacteria. I did confirm that the reads in this file do not have hits in blastn.

Is there any way to speed this process up? The log below is from when I ran using 2 cores yesterday… I have increased my requested number of cores to 8, so far, theres not much improvement and, if I understand correctly, the step that is taking so long (writing unaligned reads to the file) is not parallelized, so this should not have much of an effect.

Please let me know if I can provide any more details. Thanks in advance for your input!

08/04/2021 10:19:17 AM - humann.humann - INFO: Running humann v3.0.0
08/04/2021 10:19:17 AM - humann.humann - INFO: Output files will be written to: /mnt/std-pool/homedirs/egrant/Mareike/RNASeq/Interlacer/results-batch
08/04/2021 10:19:17 AM - humann.humann - INFO: Writing temp files to directory: /mnt/std-pool/homedirs/egrant/Mareike/RNASeq/Interlacer/results-batch/10-322-14SM_humann_temp_w_quja3t
08/04/2021 10:19:17 AM - humann.utilities - INFO: File ( /mnt/std-pool/homedirs/egrant/Mareike/RNASeq/Interlacer/10-322-14SM.fastqsanger ) is of format: fastq
08/04/2021 10:19:17 AM - humann.humann - INFO: Removing spaces from identifiers in input file
08/04/2021 10:20:12 AM - humann.utilities - DEBUG: Check software, metaphlan, for required version, 3.0
08/04/2021 10:20:13 AM - humann.utilities - INFO: Using metaphlan version 3.0
08/04/2021 10:20:13 AM - humann.utilities - DEBUG: Check software, bowtie2, for required version, 2.2
08/04/2021 10:20:13 AM - humann.utilities - INFO: Using bowtie2 version 2.4
08/04/2021 10:20:13 AM - humann.config - INFO:
Run config settings:

DATABASE SETTINGS
nucleotide database folder = /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/
protein database folder = /mnt/std-pool/homedirs/egrant/packages/db/uniref/
pathways database file 1 = /mnt/pcpnfs/homedirs/egrant/anaconda3/envs/humann/lib/python3.7/site-packages/humann/data/pathways/metacyc_reactions_level4ec_only.uniref.bz2
pathways database file 2 = /mnt/pcpnfs/homedirs/egrant/anaconda3/envs/humann/lib/python3.7/site-packages/humann/data/pathways/metacyc_pathways_structured_filtered
utility mapping database folder = /mnt/pcpnfs/homedirs/egrant/anaconda3/envs/humann/lib/python3.7/site-packages/humann/data/misc

RUN MODES
resume = True
verbose = True
bypass prescreen = False
bypass nucleotide index = False
bypass nucleotide search = False
bypass translated search = True
translated search = diamond
threads = 20

SEARCH MODE
search mode = uniref90
nucleotide identity threshold = 0.0
translated identity threshold = 80.0

ALIGNMENT SETTINGS
bowtie2 options = --very-sensitive
diamond options = --top 1 --outfmt 6
evalue threshold = 1.0
prescreen threshold = 0.01
translated subject coverage threshold = 50.0
translated query coverage threshold = 90.0
nucleotide subject coverage threshold = 50.0
nucleotide query coverage threshold = 90.0

PATHWAYS SETTINGS
minpath = on
xipe = off
gap fill = on

INPUT AND OUTPUT FORMATS
input file format = fastq
output file format = tsv
output max decimals = 10
remove stratified output = False
remove column description output = False
log level = DEBUG

08/04/2021 10:20:13 AM - humann.store - DEBUG: Initialize Alignments class instance to minimize memory use
08/04/2021 10:20:13 AM - humann.store - DEBUG: Initialize Reads class instance to minimize memory use
08/04/2021 10:20:32 AM - humann.humann - INFO: Load pathways database part 1: /mnt/pcpnfs/homedirs/egrant/anaconda3/envs/humann/lib/python3.7/site-packages/humann/data/pathways/metacyc_reactions_level4ec_only.uniref.bz2
08/04/2021 10:20:32 AM - humann.humann - INFO: Load pathways database part 2: /mnt/pcpnfs/homedirs/egrant/anaconda3/envs/humann/lib/python3.7/site-packages/humann/data/pathways/metacyc_pathways_structured_filtered
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Bacteroides.s__Bacteroides_ovatus : 20.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Bacteroides.s__Bacteroides_uniformis : 4.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Bacteroides.s__Bacteroides_thetaiotaomicron : 17.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Akkermansia.s__Akkermansia_muciniphila : 15.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Roseburia.s__Roseburia_intestinalis : 9.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Marvinbryantia.s__Marvinbryantia_formatexigens : 1.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Collinsella.s__Collinsella_aerofaciens : 1.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Bacteroides.s__Bacteroides_caccae : 12.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Barnesiella.s__Barnesiella_intestinihominis : 3.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Desulfovibrio.s__Desulfovibrio_piger : 1.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Lachnoclostridium.s__Clostridium_symbiosum : 1.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Lachnospiraceae_unclassified.s__Eubacterium_rectale : 7.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Escherichia.s__Escherichia_coli : 8.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Faecalibacterium.s__Faecalibacterium_prausnitzii : 1.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Total species selected from prescreen: 14
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Akkermansia.s__Akkermansia_muciniphila.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Bacteroides.s__Bacteroides_caccae.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Bacteroides.s__Bacteroides_ovatus.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Bacteroides.s__Bacteroides_thetaiotaomicron.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Bacteroides.s__Bacteroides_uniformis.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Barnesiella.s__Barnesiella_intestinihominis.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Collinsella.s__Collinsella_aerofaciens.centroids.v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Desulfovibrio.s__Desulfovibrio_piger.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Escherichia.s__Escherichia_coli.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Faecalibacterium.s__Faecalibacterium_prausnitzii.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Lachnoclostridium.s__Clostridium_symbiosum.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Lachnospiraceae_unclassified.s__Eubacterium_rectale.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Marvinbryantia.s__Marvinbryantia_formatexigens.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Roseburia.s__Roseburia_intestinalis.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Creating custom ChocoPhlAn database …
08/04/2021 10:20:32 AM - humann.utilities - DEBUG: Using software: /bin/gunzip
08/04/2021 10:20:32 AM - humann.utilities - INFO: Execute command: /bin/gunzip -c /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Akkermansia.s__Akkermansia_muciniphila.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Bacteroides.s__Bacteroides_caccae.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Bacteroides.s__Bacteroides_ovatus.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Bacteroides.s__Bacteroides_thetaiotaomicron.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Bacteroides.s__Bacteroides_uniformis.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Barnesiella.s__Barnesiella_intestinihominis.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Collinsella.s__Collinsella_aerofaciens.centroids.v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Desulfovibrio.s__Desulfovibrio_piger.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Escherichia.s__Escherichia_coli.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Faecalibacterium.s__Faecalibacterium_prausnitzii.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Lachnoclostridium.s__Clostridium_symbiosum.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Lachnospiraceae_unclassified.s__Eubacterium_rectale.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Marvinbryantia.s__Marvinbryantia_formatexigens.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Roseburia.s__Roseburia_intestinalis.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:34 AM - humann.humann - INFO: TIMESTAMP: Completed custom database creation : 3 seconds
08/04/2021 10:20:34 AM - humann.search.nucleotide - INFO: Running bowtie2-build …
08/04/2021 10:20:34 AM - humann.utilities - DEBUG: Using software: /mnt/pcpnfs/homedirs/egrant/anaconda3/envs/humann/bin/bowtie2-build
08/04/2021 10:20:34 AM - humann.utilities - INFO: Execute command: /mnt/pcpnfs/homedirs/egrant/anaconda3/envs/humann/bin/bowtie2-build -f /mnt/std-pool/homedirs/egrant/Mareike/RNASeq/Interlacer/results-batch/10-322-14SM_humann_temp_w_quja3t/10-322-14SM_custom_chocophlan_database.ffn /mnt/std-pool/homedirs/egrant/Mareike/RNASeq/Interlacer/results-batch/10-322-14SM_humann_temp_w_quja3t/10-322-14SM_bowtie2_index
08/04/2021 10:27:10 AM - humann.humann - INFO: TIMESTAMP: Completed database index : 396 seconds
08/04/2021 10:27:11 AM - humann.search.nucleotide - DEBUG: Nucleotide input file is of type: fastq
08/04/2021 10:27:11 AM - humann.utilities - DEBUG: Using software: /mnt/pcpnfs/homedirs/egrant/anaconda3/envs/humann/bin/bowtie2
08/04/2021 10:27:11 AM - humann.utilities - INFO: Execute command: /mnt/pcpnfs/homedirs/egrant/anaconda3/envs/humann/bin/bowtie2 -q -x /mnt/std-pool/homedirs/egrant/Mareike/RNASeq/Interlacer/results-batch/10-322-14SM_humann_temp_w_quja3t/10-322-14SM_bowtie2_index -U /mnt/std-pool/homedirs/egrant/Mareike/RNASeq/Interlacer/results-batch/10-322-14SM_humann_temp_w_quja3t/tmp53tbqzm6/tmpt92d3fca -S /mnt/std-pool/homedirs/egrant/Mareike/RNASeq/Interlacer/results-batch/10-322-14SM_humann_temp_w_quja3t/10-322-14SM_bowtie2_aligned.sam -p 20 --very-sensitive
08/04/2021 10:58:52 AM - humann.utilities - DEBUG: b’34109054 reads; of these:\n 34109054 (100.00%) were unpaired; of these:\n 8408875 (24.65%) aligned 0 times\n 17384808 (50.97%) aligned exactly 1 time\n 8315371 (24.38%) aligned >1 times\n75.35% overall alignment rate\n’
08/04/2021 10:58:52 AM - humann.humann - INFO: TIMESTAMP: Completed nucleotide alignment : 1902 seconds
08/04/2021 11:22:15 AM - humann.utilities - DEBUG: Total alignments where percent identity is not a number: 0
08/04/2021 11:22:15 AM - humann.utilities - DEBUG: Total alignments where alignment length is not a number: 0
08/04/2021 11:22:15 AM - humann.utilities - DEBUG: Total alignments where E-value is not a number: 0
08/04/2021 11:22:15 AM - humann.utilities - DEBUG: Total alignments not included based on large e-value: 0
08/04/2021 11:22:15 AM - humann.utilities - DEBUG: Total alignments not included based on small percent identity: 0
08/04/2021 11:22:15 AM - humann.utilities - DEBUG: Total alignments not included based on small query coverage: 0
08/04/2021 11:23:17 AM - humann.search.blastx_coverage - INFO: Total alignments without coverage information: 0
08/04/2021 11:23:17 AM - humann.search.blastx_coverage - INFO: Total proteins in blastx output: 60535
08/04/2021 11:23:17 AM - humann.search.blastx_coverage - INFO: Total proteins without lengths: 0
08/04/2021 11:23:17 AM - humann.search.blastx_coverage - INFO: Proteins with coverage greater than threshold (50.0): 36355

This is a really interesting application. Based on our benchmarking, profiling ~30M reads without translated search takes ~4 cpu hours. The bulk of that time will be spent mapping reads (and should benefit from parallelization), but you’re right that some of the general file I/O is not parallelizable and also non-trivial.

You’re running you samples in serial in a loop - did any of them finish, or are you still stuck writing unmapped reads on the first one? If it’s the latter, then something is definitely fishy. As a side note: I’d recommend running one sample first and inspecting the results to see if they make sense before committing to running all samples (even if it’s only 8 total). You might need to do some additional tuning for this application.

As another side note: You mentioned that you switched from 2 to 8 cores in your message, but I don’t see a --threads request in your HUMAnN call? HUMAnN will default to 1 core (not all available cores) unless you tell it to use more. Granted, this would not explain why you appear to be stuck writing the unaligned reads.

Hi Eric,

Thanks for your feedback. In the loop that I originally ran, I would always be stuck on the first sample. I tried running it with only one sample and also specifying --threads, however am still running into the same issue where the writing of unaligned.fa takes an eternity. I guess this file it should be 1.4 GB in the end (around 1/4 of the size of my original reads file which was 5.8 GB)… after 7 hours it is still only 0.5 GB in size (growing by ~1000kb / min)…

Since the aligned.sam file appears to complete, I decided to try to provide this as input since it should pick-up after the alignment steps (GitHub - biobakery/humann: HUMAnN 3.0 is the next generation of HUMAnN 1.0 (HMP Unified Metabolic Analysis Network).). Unfortunately, this still triggers the generation of an unaligned file… so I am in the same boat as before.

I think if there is a way to specify to bowtie2 that I do not want the unaligned reads output, I would be able to proceed, but I am not sure if this is possible to do?

Thanks again for your input / any ideas!

08/06/2021 11:51:26 AM - humann.humann - INFO: Running humann v3.0.0
08/06/2021 11:51:26 AM - humann.humann - INFO: Output files will be written to: /mnt/std-pool/homedirs/egrant/Mareike/RNASeq/Interlacer/results-10-322-14SM/results-10-322-14SM
08/06/2021 11:51:26 AM - humann.humann - INFO: Writing temp files to directory: /mnt/std-pool/homedirs/egrant/Mareike/RNASeq/Interlacer/results-10-322-14SM/results-10-322-14SM/10-322-14SM_bowtie2_aligned_humann_temp
08/06/2021 11:51:26 AM - humann.utilities - INFO: File ( /mnt/std-pool/homedirs/egrant/Mareike/RNASeq/Interlacer/results-10-322-14SM/10-322-14SM_humann_temp_8o2o9n7u/10-322-14SM_bowtie2_aligned.sam ) is of format: sam
08/06/2021 11:51:26 AM - humann.config - INFO:
Run config settings:

DATABASE SETTINGS
nucleotide database folder = /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/
protein database folder = /mnt/std-pool/homedirs/egrant/packages/db/uniref/
pathways database file 1 = /mnt/pcpnfs/homedirs/egrant/anaconda3/envs/humann/lib/python3.7/site-packages/humann/data/pathways/metacyc_reactions_level4ec_only.uniref.bz2
pathways database file 2 = /mnt/pcpnfs/homedirs/egrant/anaconda3/envs/humann/lib/python3.7/site-packages/humann/data/pathways/metacyc_pathways_structured_filtered
utility mapping database folder = /mnt/pcpnfs/homedirs/egrant/anaconda3/envs/humann/lib/python3.7/site-packages/humann/data/misc

RUN MODES
resume = True
verbose = True
bypass prescreen = True
bypass nucleotide index = True
bypass nucleotide search = True
bypass translated search = True
translated search = diamond
threads = 150

SEARCH MODE
search mode = uniref90
nucleotide identity threshold = 0.0
translated identity threshold = 80.0

ALIGNMENT SETTINGS
bowtie2 options = --very-sensitive
diamond options = --top 1 --outfmt 6
evalue threshold = 1.0
prescreen threshold = 0.01
translated subject coverage threshold = 50.0
translated query coverage threshold = 90.0
nucleotide subject coverage threshold = 50.0
nucleotide query coverage threshold = 90.0

PATHWAYS SETTINGS
minpath = on
xipe = off
gap fill = on

INPUT AND OUTPUT FORMATS
input file format = sam
output file format = tsv
output max decimals = 10
remove stratified output = False
remove column description output = False
log level = DEBUG

08/06/2021 11:51:26 AM - humann.store - DEBUG: Initialize Alignments class instance to minimize memory use
08/06/2021 11:51:26 AM - humann.store - DEBUG: Initialize Reads class instance to minimize memory use
08/06/2021 11:51:45 AM - humann.humann - INFO: Load pathways database part 1: /mnt/pcpnfs/homedirs/egrant/anaconda3/envs/humann/lib/python3.7/site-packages/humann/data/pathways/metacyc_reactions_level4ec_only.uniref.bz2
08/06/2021 11:51:45 AM - humann.humann - INFO: Load pathways database part 2: /mnt/pcpnfs/homedirs/egrant/anaconda3/envs/humann/lib/python3.7/site-packages/humann/data/pathways/metacyc_pathways_structured_filtered
08/06/2021 11:51:45 AM - humann.humann - INFO: Process the sam mapping results …
08/06/2021 12:15:20 PM - humann.utilities - DEBUG: Total alignments where percent identity is not a number: 0
08/06/2021 12:15:20 PM - humann.utilities - DEBUG: Total alignments where alignment length is not a number: 0
08/06/2021 12:15:20 PM - humann.utilities - DEBUG: Total alignments where E-value is not a number: 0
08/06/2021 12:15:20 PM - humann.utilities - DEBUG: Total alignments not included based on large e-value: 0
08/06/2021 12:15:20 PM - humann.utilities - DEBUG: Total alignments not included based on small percent identity: 0
08/06/2021 12:15:20 PM - humann.utilities - DEBUG: Total alignments not included based on small query coverage: 0
08/06/2021 12:16:22 PM - humann.search.blastx_coverage - INFO: Total alignments without coverage information: 0
08/06/2021 12:16:22 PM - humann.search.blastx_coverage - INFO: Total proteins in blastx output: 60535
08/06/2021 12:16:22 PM - humann.search.blastx_coverage - INFO: Total proteins without lengths: 0
08/06/2021 12:16:22 PM - humann.search.blastx_coverage - INFO: Proteins with coverage greater than threshold (50.0): 36355

It’s not possible to disable writing the unaligned reads file, but I’ve never known it to be a bottleneck before (it should finish in a couple of minutes). Is it possible that you’re bumping up against a storage quota for that directory? Granted, I would’ve expected that to halt the writing entirely - not allow it to proceed slowly. I’m running this question by our Senior Developer (Lauren) as well to see if she has other ideas.

EDIT: Lauren noticed that in the second example you’re requesting 150 threads, which is quite large for a single HUMAnN job. There are a number of reasons why that could be counter-productive and causing weird slowdowns (e.g. your machine can’t actually support that compute, or that level of context switching during file writing could be problematic). Can you try running a single sample with a smaller number of threads, e.g. 8? If that doesn’t solve the problem, can you give us some background on the computing environment where you’re running these jobs (number of processors, RAM, load from other users, etc.)?

Hi Eric,

Thanks for the note on threads (oops!). I tried running with 8-16 threads and had the same issue. I initially ran this on our institute system, but then switched to the university clusters on a “regular” node with 28 CPUs and 128 RAM: Homepage | HPC @ Uni.lu
I had the same problem in both places, despite the fact that the university has a larger storage quota, so should rule out that explanation. I did not have any success running it on the weekend, which typically has less activity (24 users, avg load <1 at the time of writing).

I will check in with the HPC admin just to be sure I am not overlooking some storage/memory restriction… but please also let me know in case you have other ideas. Thanks again for your help!

The fact that you’re seeing the same unusual behavior on two different systems makes me wonder if there is something peculiar with your input data (that is throwing off the parsing of unmapped reads). Can you try running the same command on one of your systems using a synthetic metagenome from the bioBakery 3 paper? They are available for download here (~8 GB):

http://cmprod1.cibio.unitn.it/biobakery3/synphlan_nonhuman_feb2020.tar

Hi Eric,

I downloaded the synthetic metagenome and ran the first sample (largepan_01.fq.gz) without issue. Then to be sure that there is not an issue with the command I normally use, I ran it again using the largepan_01_metaphlan_bugs_list.tsv from the first run and bypassing translated search. This also ran correctly, producing the 3 expected tsv output files.

In the end, I decided that I don’t seem to have a space issue or a commend syntax issue, but there muct be something odd about my input files that is tripping things up.

Since I have really only worked with 16S data in the past, I felt more comfortable doing read cleanup in Galaxy, following this tutorial: Metatranscriptomics analysis using microbiome RNA-seq data

I had tried running humann with the interlaced reads as input (as I would in the tutorial) and concatenated files folowing the SortMeRNA step and both have had the same issue. BUT finally I tried just using the raw fastq files (adapters removed during post-processing by the lab that does the sequencing and we use ribo-depletion against mouse/bacterial rRNA so should be at least half non-rRNA bactera reads)–and IT WORKED! Right now I am going back and removing mouse and remaining rRNA contamination (using kneaddata)… but happy to say that the issue is most likely that something weird happened to my files when I used Galaxy for the QC/cleanup.

Anyway, thanks very much for your ideas/support! I’m very exited to be able to implement this workflow on our data :smiley:

Awesome! Glad that the synthetic data experiment highlighted the real issue. I’m super curious to know what was up with the Galaxy-QC’ed reads that was causing the slowdown, but no worries if it’s not obvious.

Dear Franzosa and Erica,

I am currently employing kneaddata, metaphlan4, and humann3.7 to analyze my mgx and mtx data. Firstly, kneaddata removed around 1G host sequences and contaminated sequences for each sample. For each metagenomics sample, both forward and reverse file has 12G clean data.

Then I concatenated the reverse and forward file by “cat” command and got 24G for each sample. After that, metaphlan 4 ran normally and I got the abundance.txt.

However, I also met the similar problem like Erica. Humann3.7 has been running more than 4 days and it seems to not stop.

I have tried one sample (largepan_01.fq.gz) you suggested above, around 1G. Humann3 finished it with no more than 8 hours with 8 threads and I got 3 expected files. This indicates that the syntax issue and humann3.7 tool do have errors.

Another issue is that unaligned reads after nucleotide alignment are too high, reaching 93.9035402519 %.

My setting environment is pig gut feces with 24 G clean data for each mgx sample.
The log file shows like the follow
RUN MODES
resume = False
verbose = False
bypass prescreen = False
bypass nucleotide index = False
bypass nucleotide search = False
bypass translated search = False
translated search = diamond
threads = 8

SEARCH MODE
search mode = uniref50
nucleotide identity threshold = 0.0
translated identity threshold = 50.0

ALIGNMENT SETTINGS
bowtie2 options = --very-sensitive
diamond options = --top 1 --sensitive --outfmt 6
evalue threshold = 1.0
prescreen threshold = 0.01
translated subject coverage threshold = 50.0
translated query coverage threshold = 90.0
nucleotide subject coverage threshold = 50.0
nucleotide query coverage threshold = 90.0

PATHWAYS SETTINGS
minpath = on
xipe = off
gap fill = on

68734826 (100.00%) were unpaired; of these:\n 64323084 (93.58%) aligned 0 times\n 2896235 (4.21%) aligned exactly 1 time\n 1515507 (2.20%) aligned >1 times\n6.42% overall alignment rate\n’

06/05/2023 04:15:39 PM - humann.humann - INFO: Total gene families from nucleotide alignment: 85969
06/05/2023 04:15:39 PM - humann.humann - INFO: Unaligned reads after nucleotide alignment: 93.9035402519 %

06/05/2023 08:52:30 PM - humann.search.translated - INFO: Aligning to reference database: uniref90_201901b_full.dmnd
06/05/2023 08:52:30 PM - humann.utilities - DEBUG: Remove file: /lustre/BIF/nobackup/liu281/wp2_data/mgx_rawdata/hum_result/concatenated_UT0_humann_temp/tmp11wwnq2_/diamond_m8_hog6j1zw
06/05/2023 08:52:30 PM - humann.utilities - DEBUG: Using software: /lustre/BIF/nobackup/liu281/miniconda3/envs/hum/bin/diamond
06/05/2023 08:52:30 PM - humann.utilities - INFO: Execute command: /lustre/BIF/nobackup/liu281/miniconda3/envs/hum/bin/diamond blastx -

Now, it is still running.

Could you please give me some suggestions to improve the humann3 performance when running my data?

Thanks,

Jun

It looks like we couldn’t identify many species from the sample, hence >90% of reads are being forwarded to UniRef50 translated search, which is the slowest part of the pipeline unfortunately. You also have quite deeply sequenced samples (I’m guessing ~100M reads based on the file size?).

One option would be to use the EC-filtered UniRef50 database for translated search. This database is about 10% of the size as the full UniRef50 database because it only includes proteins with an enzyme annotation. This will limit the number of reads you can explain, but it should be a lot faster, and the reads you CAN explain will have good explanations (i.e. known enzymes rather than uncharacterized UniRef50 families).

Hi Eric, thanks for your suggestions. The clean sample size is ~ 60M reads.

It is a little bit weird that >90% are unaligned by Humann3 although the taxonomic profile generated by Metaphlan4 is generally good. Bacteria account for 96% and Archaea ~ 4% of relative abundance. Unknown SGBs account for around 40% and most are kSGBs.
I would like to check some of those unaligned reads and to see who are there. I will try to only use EC-filtered UniRef50 database as you suggested. Then post the result here.

Thanks a lot.

Jun

Hi Franzosa,

I got the result of humann3.7 on my mgx data with EC-filtered UniRef50 database. It (~19 hours with 8 threads for 24G data) was much faster than using the full UniRef50 database (5 days already). The alignment rate after the translated increased from around ~6% to ~30%, and unmapped rate is ~30%. Both are accepted. I think.

Thanks for your helpful suggestions.

Jun

Great, thanks for following up and glad to hear this helped!