Hello,
I would like to use humann3 on RNAseq data from a defined microbial community and am a bit concerned with how long it is taking to run and whether I am doing something wrong or that can be improved. It seems to be bottlenecked at the nucleotide alignment post-processsing step, where bowtie2 is writing unaligned reads to a .fa file to be used in the translated search step. After >12 hours, there is still no end in sight (though the unaligned.fa file is continuously updated over this period).
My input command is below (I have 8 files, which contain the concatenated F and R reads, each file is ~3-5 GB):
for f in *.fastqsanger; do humann -i $f -o results-batch --taxonomic-profile 14SM_abundance.tsv --bypass-translated-search --remove-temp-output --resume --verbose;done >out.log 2>out.err
I skipped metaphlan and used a taxonomic profile based on 16S seq results (I used average profile for all samples since this is just used to build the custom chocophlan database–from what I can tell, the actual abundances are not used by humann). During nucleotide alignment, 75% of reads are mapped overall (50% uniquely, 25% >1 time). To my knowledge, this is reasonable, so should be fine to proceed. Since the 14 bacteria are well-characterised, I opted to use the --bypass-translated-search option to reduce the run time, so the unaligned file is not even that important except to confirm that I am not losing reads that should map to the 14 bacteria. I did confirm that the reads in this file do not have hits in blastn.
Is there any way to speed this process up? The log below is from when I ran using 2 cores yesterday… I have increased my requested number of cores to 8, so far, theres not much improvement and, if I understand correctly, the step that is taking so long (writing unaligned reads to the file) is not parallelized, so this should not have much of an effect.
Please let me know if I can provide any more details. Thanks in advance for your input!
08/04/2021 10:19:17 AM - humann.humann - INFO: Running humann v3.0.0
08/04/2021 10:19:17 AM - humann.humann - INFO: Output files will be written to: /mnt/std-pool/homedirs/egrant/Mareike/RNASeq/Interlacer/results-batch
08/04/2021 10:19:17 AM - humann.humann - INFO: Writing temp files to directory: /mnt/std-pool/homedirs/egrant/Mareike/RNASeq/Interlacer/results-batch/10-322-14SM_humann_temp_w_quja3t
08/04/2021 10:19:17 AM - humann.utilities - INFO: File ( /mnt/std-pool/homedirs/egrant/Mareike/RNASeq/Interlacer/10-322-14SM.fastqsanger ) is of format: fastq
08/04/2021 10:19:17 AM - humann.humann - INFO: Removing spaces from identifiers in input file
08/04/2021 10:20:12 AM - humann.utilities - DEBUG: Check software, metaphlan, for required version, 3.0
08/04/2021 10:20:13 AM - humann.utilities - INFO: Using metaphlan version 3.0
08/04/2021 10:20:13 AM - humann.utilities - DEBUG: Check software, bowtie2, for required version, 2.2
08/04/2021 10:20:13 AM - humann.utilities - INFO: Using bowtie2 version 2.4
08/04/2021 10:20:13 AM - humann.config - INFO:
Run config settings:
DATABASE SETTINGS
nucleotide database folder = /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/
protein database folder = /mnt/std-pool/homedirs/egrant/packages/db/uniref/
pathways database file 1 = /mnt/pcpnfs/homedirs/egrant/anaconda3/envs/humann/lib/python3.7/site-packages/humann/data/pathways/metacyc_reactions_level4ec_only.uniref.bz2
pathways database file 2 = /mnt/pcpnfs/homedirs/egrant/anaconda3/envs/humann/lib/python3.7/site-packages/humann/data/pathways/metacyc_pathways_structured_filtered
utility mapping database folder = /mnt/pcpnfs/homedirs/egrant/anaconda3/envs/humann/lib/python3.7/site-packages/humann/data/misc
RUN MODES
resume = True
verbose = True
bypass prescreen = False
bypass nucleotide index = False
bypass nucleotide search = False
bypass translated search = True
translated search = diamond
threads = 20
SEARCH MODE
search mode = uniref90
nucleotide identity threshold = 0.0
translated identity threshold = 80.0
ALIGNMENT SETTINGS
bowtie2 options = --very-sensitive
diamond options = --top 1 --outfmt 6
evalue threshold = 1.0
prescreen threshold = 0.01
translated subject coverage threshold = 50.0
translated query coverage threshold = 90.0
nucleotide subject coverage threshold = 50.0
nucleotide query coverage threshold = 90.0
PATHWAYS SETTINGS
minpath = on
xipe = off
gap fill = on
INPUT AND OUTPUT FORMATS
input file format = fastq
output file format = tsv
output max decimals = 10
remove stratified output = False
remove column description output = False
log level = DEBUG
08/04/2021 10:20:13 AM - humann.store - DEBUG: Initialize Alignments class instance to minimize memory use
08/04/2021 10:20:13 AM - humann.store - DEBUG: Initialize Reads class instance to minimize memory use
08/04/2021 10:20:32 AM - humann.humann - INFO: Load pathways database part 1: /mnt/pcpnfs/homedirs/egrant/anaconda3/envs/humann/lib/python3.7/site-packages/humann/data/pathways/metacyc_reactions_level4ec_only.uniref.bz2
08/04/2021 10:20:32 AM - humann.humann - INFO: Load pathways database part 2: /mnt/pcpnfs/homedirs/egrant/anaconda3/envs/humann/lib/python3.7/site-packages/humann/data/pathways/metacyc_pathways_structured_filtered
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Bacteroides.s__Bacteroides_ovatus : 20.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Bacteroides.s__Bacteroides_uniformis : 4.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Bacteroides.s__Bacteroides_thetaiotaomicron : 17.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Akkermansia.s__Akkermansia_muciniphila : 15.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Roseburia.s__Roseburia_intestinalis : 9.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Marvinbryantia.s__Marvinbryantia_formatexigens : 1.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Collinsella.s__Collinsella_aerofaciens : 1.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Bacteroides.s__Bacteroides_caccae : 12.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Barnesiella.s__Barnesiella_intestinihominis : 3.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Desulfovibrio.s__Desulfovibrio_piger : 1.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Lachnoclostridium.s__Clostridium_symbiosum : 1.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Lachnospiraceae_unclassified.s__Eubacterium_rectale : 7.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Escherichia.s__Escherichia_coli : 8.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Found g__Faecalibacterium.s__Faecalibacterium_prausnitzii : 1.00% of mapped reads
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Total species selected from prescreen: 14
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Akkermansia.s__Akkermansia_muciniphila.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Bacteroides.s__Bacteroides_caccae.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Bacteroides.s__Bacteroides_ovatus.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Bacteroides.s__Bacteroides_thetaiotaomicron.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Bacteroides.s__Bacteroides_uniformis.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Barnesiella.s__Barnesiella_intestinihominis.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Collinsella.s__Collinsella_aerofaciens.centroids.v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Desulfovibrio.s__Desulfovibrio_piger.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Escherichia.s__Escherichia_coli.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Faecalibacterium.s__Faecalibacterium_prausnitzii.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Lachnoclostridium.s__Clostridium_symbiosum.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Lachnospiraceae_unclassified.s__Eubacterium_rectale.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Marvinbryantia.s__Marvinbryantia_formatexigens.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - DEBUG: Adding file to database: g__Roseburia.s__Roseburia_intestinalis.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:32 AM - humann.search.prescreen - INFO: Creating custom ChocoPhlAn database …
08/04/2021 10:20:32 AM - humann.utilities - DEBUG: Using software: /bin/gunzip
08/04/2021 10:20:32 AM - humann.utilities - INFO: Execute command: /bin/gunzip -c /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Akkermansia.s__Akkermansia_muciniphila.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Bacteroides.s__Bacteroides_caccae.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Bacteroides.s__Bacteroides_ovatus.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Bacteroides.s__Bacteroides_thetaiotaomicron.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Bacteroides.s__Bacteroides_uniformis.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Barnesiella.s__Barnesiella_intestinihominis.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Collinsella.s__Collinsella_aerofaciens.centroids.v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Desulfovibrio.s__Desulfovibrio_piger.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Escherichia.s__Escherichia_coli.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Faecalibacterium.s__Faecalibacterium_prausnitzii.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Lachnoclostridium.s__Clostridium_symbiosum.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Lachnospiraceae_unclassified.s__Eubacterium_rectale.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Marvinbryantia.s__Marvinbryantia_formatexigens.centroids.v296_v201901b.ffn.gz /mnt/std-pool/homedirs/egrant/packages/db/chocophlan/g__Roseburia.s__Roseburia_intestinalis.centroids.v296_v201901b.ffn.gz
08/04/2021 10:20:34 AM - humann.humann - INFO: TIMESTAMP: Completed custom database creation : 3 seconds
08/04/2021 10:20:34 AM - humann.search.nucleotide - INFO: Running bowtie2-build …
08/04/2021 10:20:34 AM - humann.utilities - DEBUG: Using software: /mnt/pcpnfs/homedirs/egrant/anaconda3/envs/humann/bin/bowtie2-build
08/04/2021 10:20:34 AM - humann.utilities - INFO: Execute command: /mnt/pcpnfs/homedirs/egrant/anaconda3/envs/humann/bin/bowtie2-build -f /mnt/std-pool/homedirs/egrant/Mareike/RNASeq/Interlacer/results-batch/10-322-14SM_humann_temp_w_quja3t/10-322-14SM_custom_chocophlan_database.ffn /mnt/std-pool/homedirs/egrant/Mareike/RNASeq/Interlacer/results-batch/10-322-14SM_humann_temp_w_quja3t/10-322-14SM_bowtie2_index
08/04/2021 10:27:10 AM - humann.humann - INFO: TIMESTAMP: Completed database index : 396 seconds
08/04/2021 10:27:11 AM - humann.search.nucleotide - DEBUG: Nucleotide input file is of type: fastq
08/04/2021 10:27:11 AM - humann.utilities - DEBUG: Using software: /mnt/pcpnfs/homedirs/egrant/anaconda3/envs/humann/bin/bowtie2
08/04/2021 10:27:11 AM - humann.utilities - INFO: Execute command: /mnt/pcpnfs/homedirs/egrant/anaconda3/envs/humann/bin/bowtie2 -q -x /mnt/std-pool/homedirs/egrant/Mareike/RNASeq/Interlacer/results-batch/10-322-14SM_humann_temp_w_quja3t/10-322-14SM_bowtie2_index -U /mnt/std-pool/homedirs/egrant/Mareike/RNASeq/Interlacer/results-batch/10-322-14SM_humann_temp_w_quja3t/tmp53tbqzm6/tmpt92d3fca -S /mnt/std-pool/homedirs/egrant/Mareike/RNASeq/Interlacer/results-batch/10-322-14SM_humann_temp_w_quja3t/10-322-14SM_bowtie2_aligned.sam -p 20 --very-sensitive
08/04/2021 10:58:52 AM - humann.utilities - DEBUG: b’34109054 reads; of these:\n 34109054 (100.00%) were unpaired; of these:\n 8408875 (24.65%) aligned 0 times\n 17384808 (50.97%) aligned exactly 1 time\n 8315371 (24.38%) aligned >1 times\n75.35% overall alignment rate\n’
08/04/2021 10:58:52 AM - humann.humann - INFO: TIMESTAMP: Completed nucleotide alignment : 1902 seconds
08/04/2021 11:22:15 AM - humann.utilities - DEBUG: Total alignments where percent identity is not a number: 0
08/04/2021 11:22:15 AM - humann.utilities - DEBUG: Total alignments where alignment length is not a number: 0
08/04/2021 11:22:15 AM - humann.utilities - DEBUG: Total alignments where E-value is not a number: 0
08/04/2021 11:22:15 AM - humann.utilities - DEBUG: Total alignments not included based on large e-value: 0
08/04/2021 11:22:15 AM - humann.utilities - DEBUG: Total alignments not included based on small percent identity: 0
08/04/2021 11:22:15 AM - humann.utilities - DEBUG: Total alignments not included based on small query coverage: 0
08/04/2021 11:23:17 AM - humann.search.blastx_coverage - INFO: Total alignments without coverage information: 0
08/04/2021 11:23:17 AM - humann.search.blastx_coverage - INFO: Total proteins in blastx output: 60535
08/04/2021 11:23:17 AM - humann.search.blastx_coverage - INFO: Total proteins without lengths: 0
08/04/2021 11:23:17 AM - humann.search.blastx_coverage - INFO: Proteins with coverage greater than threshold (50.0): 36355