Optimising Humann run time - low species number - uniref database question

Hello. I was hoping to ask a quick question about the Humann run time on conda, as I have been running a single sample for nearly 48 hours. I also apologise for my naivety, I typically work with 16S and ITS amplicon sequencing data so this is all new for me. The samples are paired-ended metagenomic data from soils, sequenced on the NextSeq 2000 with 150x150bp. For this particular sample, I started off with roughly 4.9 gigabases of data. Before trimming, just about all of the reads were >140bp in length. I trimmed the data using fastp, which dropped it to around 4.6 gigabases of data. I used fastp because I was concerned about the polyG tails generated from nextseq sequencing and I was uncertain about how Kneadata/trimmomatic deals with it. I then used the trimmed data with Kneaddata and using the human_genome database. I bypassed the trim step and I concatenated the final output. (Side question: looking at the raw pair1 and raw pair2, they have the same numbers. I assume Kneaddata combines them?). The Kneaddata log files specifics 14044534 for final pair1 and final pair2, which I believe equates to 4.2 gigabases of data (14044534*300 / 1000000000). I am currently using the concatenated file (X_Kneaded.fastq) as my input file for Humann3. The concatenated file is nearly 11 gigabytes in size. I am using the the databases: chocophlan and uniref90_diamond. I am running the humann command with all the defaults and 24 threads. This one sample has been running for nearly 48 hours. I don’t think(?) the command is stuck because it is writing data to the tmp folder. Is this typical?

Secondary question: I was uncertain about how to use the argument “–memory-use” and what the minimum value is? I have 768 gigabytes of memory available. Should I use the memory use argument and set it to ~ --memory_use 700, 750?

Third question: I know that Humann uses Metaphlan for taxonomic assignment. Am I able to get the taxonomic data after Humann finishes running or do I need to run Metaphlan as well?

Thank you so much for your help!


Hello. I just wanted to provide an update. It ended up finishing running after about 2.5 days. However, now I am having an issue with a high percentage of unaligned reads after translation (~70%) using the uniref90 database.

Using the “_diamond_unaligned.fa” file, I ran some blastx on the first 5 sequences and it looks like they’re matching to microbial proteins but the percentage identity varies from 58% to 85%, so I assume I am just missing things with the uniref90. I will re-run the data with uniref50. If I only use uniref50, will I miss proteins with good matches (i.e. those hits I currently have against the uniref90 database)? Is it recommended to have both Uniref90 and Uniref50 in the same folder? Alternatively, would you recommend using the uniref90 unaligned output file “_diamond_unaligned.fa” with another humann run against the uniref50 database?

What is also strange to me is that the “_metaphlan_bugs_list” has very few microorganisms. I count roughly 15 species in this file. The starting environment was soil, so I would expect to see far more microbes. The split ended up being 97% bacteria and 3% fungi. The 97 % bacteria was represented by 13 species. Is this typical?

Another side question: From my kneaddata output, I ran some blastn on the “bowtie2_paired_contam” files and some of the hits are definitely hitting microroganisms. For example, one hit against Aspergillus Flavus strain A9 chromosome 7 (100% query, 2e-69, identity 99.34%). Is this also typical? Maybe it’s a human-associated fungal species?

Hello William, With an input file of ~11Gb, depending on the number of reads aligned in the nucleotide step I don’t think 48 hours is unexpected. I would give it ~12 hours more and if not complete at that time you might double check that it is not stuck.

The memory usage option selections are: [“minimum”,“maximum”] . Min is the default with max using a lot more memory and less disk space. We always use the “min” mode especially for large input files.

Yes, the taxonomic profile is included in the output folder.