Optimising Humann run time - low species number - uniref database question

wking · February 9, 2022, 4:40pm

Hello. I was hoping to ask a quick question about the Humann run time on conda, as I have been running a single sample for nearly 48 hours. I also apologise for my naivety, I typically work with 16S and ITS amplicon sequencing data so this is all new for me. The samples are paired-ended metagenomic data from soils, sequenced on the NextSeq 2000 with 150x150bp. For this particular sample, I started off with roughly 4.9 gigabases of data. Before trimming, just about all of the reads were >140bp in length. I trimmed the data using fastp, which dropped it to around 4.6 gigabases of data. I used fastp because I was concerned about the polyG tails generated from nextseq sequencing and I was uncertain about how Kneadata/trimmomatic deals with it. I then used the trimmed data with Kneaddata and using the human_genome database. I bypassed the trim step and I concatenated the final output. (Side question: looking at the raw pair1 and raw pair2, they have the same numbers. I assume Kneaddata combines them?). The Kneaddata log files specifics 14044534 for final pair1 and final pair2, which I believe equates to 4.2 gigabases of data (14044534*300 / 1000000000). I am currently using the concatenated file (X_Kneaded.fastq) as my input file for Humann3. The concatenated file is nearly 11 gigabytes in size. I am using the the databases: chocophlan and uniref90_diamond. I am running the humann command with all the defaults and 24 threads. This one sample has been running for nearly 48 hours. I don’t think(?) the command is stuck because it is writing data to the tmp folder. Is this typical?

Secondary question: I was uncertain about how to use the argument “–memory-use” and what the minimum value is? I have 768 gigabytes of memory available. Should I use the memory use argument and set it to ~ --memory_use 700, 750?

Third question: I know that Humann uses Metaphlan for taxonomic assignment. Am I able to get the taxonomic data after Humann finishes running or do I need to run Metaphlan as well?

Thank you so much for your help!

Regards,
William

wking · February 10, 2022, 3:30pm

Hello. I just wanted to provide an update. It ended up finishing running after about 2.5 days. However, now I am having an issue with a high percentage of unaligned reads after translation (~70%) using the uniref90 database.

Using the “_diamond_unaligned.fa” file, I ran some blastx on the first 5 sequences and it looks like they’re matching to microbial proteins but the percentage identity varies from 58% to 85%, so I assume I am just missing things with the uniref90. I will re-run the data with uniref50. If I only use uniref50, will I miss proteins with good matches (i.e. those hits I currently have against the uniref90 database)? Is it recommended to have both Uniref90 and Uniref50 in the same folder? Alternatively, would you recommend using the uniref90 unaligned output file “_diamond_unaligned.fa” with another humann run against the uniref50 database?

What is also strange to me is that the “_metaphlan_bugs_list” has very few microorganisms. I count roughly 15 species in this file. The starting environment was soil, so I would expect to see far more microbes. The split ended up being 97% bacteria and 3% fungi. The 97 % bacteria was represented by 13 species. Is this typical?

Another side question: From my kneaddata output, I ran some blastn on the “bowtie2_paired_contam” files and some of the hits are definitely hitting microroganisms. For example, one hit against Aspergillus Flavus strain A9 chromosome 7 (100% query, 2e-69, identity 99.34%). Is this also typical? Maybe it’s a human-associated fungal species?

lauren.j.mciver · February 11, 2022, 10:28pm

Hello William, With an input file of ~11Gb, depending on the number of reads aligned in the nucleotide step I don’t think 48 hours is unexpected. I would give it ~12 hours more and if not complete at that time you might double check that it is not stuck.

The memory usage option selections are: [“minimum”,“maximum”] . Min is the default with max using a lot more memory and less disk space. We always use the “min” mode especially for large input files.

Yes, the taxonomic profile is included in the output folder.

Thanks!
Lauren

Topic		Replies	Views
Bowtie2 unaligned reads slow HUMAnN	14	1975	November 8, 2024
Protein database choose and low aligned rate in humann HUMAnN	13	417	September 5, 2023
Humann3 computation speed HUMAnN	1	2030	September 29, 2020
Humann3 metatranscriptome analysis stuck at nucleotide alignment post processing HUMAnN	13	1365	March 7, 2023
Low percentage of aligned reads HUMAnN	17	4999	January 29, 2021

Optimising Humann run time - low species number - uniref database question

Related topics