Error in humann3 run

PA.02.log.txt (354.3 KB)
Hi, I am running humann3 for the first time. I am running it on a server using 20 threads using this command:
for i in *.fastq; do humann --input i --threads 20 --output ../humann3/{i%.fastq}humann3; done
I seem to get the *_genefamilies.tsv but then the run crashes. The temp folder has the following content:
Sep 14 13:03 PA.02_bowtie2_aligned.sam
Sep 14 13:05 PA.02_bowtie2_aligned.tsv
Sep 14 12:57 PA.02_bowtie2_index.1.bt2
Sep 14 12:57 PA.02_bowtie2_index.2.bt2
Sep 14 12:53 PA.02_bowtie2_index.3.bt2
Sep 14 12:53 PA.02_bowtie2_index.4.bt2
Sep 14 13:00 PA.02_bowtie2_index.rev.1.bt2
Sep 14 13:00 PA.02_bowtie2_index.rev.2.bt2
Sep 14 13:08 PA.02_bowtie2_unaligned.fa
Sep 14 12:53 PA.02_custom_chocophlan_database.ffn
Sep 14 14:52 PA.02_diamond_aligned.tsv
Sep 14 14:54 PA.02_diamond_unaligned.fa
Sep 14 14:56 PA.02.log
Sep 14 12:52 PA.02_metaphlan_bowtie2.txt
Sep 14 12:53 PA.02_metaphlan_bugs_list.tsv
drwx------. Sep 10 18:28 tmp76fgi6xd
drwx------. Sep 14 14:56 tmptiht92rj
I attach the log file. It seems there might be a memory issue but at the same time it appears to me the program is not using the server resources fully?
I very much appreciate your help!
/Stef

Hello Stef, Thank you for the detailed post and for including the log. HUMAnN will print system stats (memory, CPU) if a subprocess fails (even if that failure might not be due to resources as long as the user has the required python package psutil installed) just to provide additional information to help with debugging. I agree with you; I don’t think the error is due to memory. It looks like the error is from MinPath. I have seen errors like this before if an older version of glpsol (included with glpk) is installed. If you upgrade your glpk install I think it should resolve the errors that you are seeing.

Thank you,
Lauren

Thank you so much Lauren! I upgraded now. So if I cancel my humann3 run now (it is still running since two days) and restart with for i in *.fastq; do humann --input i --threads 20 --output ../humann3/{i%.fastq}humann3; done
will it pick up where it crashed for all my files? Do I need to use any --bypass flag? Or do I need to delete all output and start all over again? I hope not since it has been running for 2 days already and is not half through.
One other concern: I compared the % unmapped in *_genefamilies.tsv and could see that the % is about twice the value compared to a default run using humann2. Is this due to the requirement of having an > 50% genome coverage? Or was that the same value before? My samples are not extremely deep-sequenced - most have about 2 Mio reads but some just above 1 Mio, however, they are from a MiSeq, thus 2x300bp. Would you suggest lowering this coverage? Should I lower it to the same value, e.g. 30.0 for both --translated-subject-coverage-threshold and --nucleotide-subject-coverage-threshold?
Again, thank you for your help!
Kind regards,
Stef

Hi Stef, HUMAnN has a “resume” mode where it will start where it left off. Just add the option “–resume” and the software will not rerun any of the compute intensive portions of the workflow including the nucleotide and translated search portions. Alternatively you can provide the gene families output file as input to HUMAnN and it will just generate the pathways files. We did include query filtering coverage for the nucleotide alignments, reduced the percent identity for the UniRef90 run mode, and modified our default diamond search settings. I am not sure what exactly would account for the change in unmapped reads that you might be seeing. Let me double check with Eric as I am not sure if you should reduce the coverage thresholds based on read coverage.

Thank you,
Lauren

Hi Stef, I checked in with Eric and he said since your reads are long (2x300) they might tend to overhang genes. You could try turning off the nucleotide coverage as that would be more strict in the latest version of HUMAnN by setting the option “–nucleotide-subject-coverage-threshold 0”.

Thank you,
Lauren

Dear Lauren,
Thank you for helping me! I have now tried the following: for i in *.fastq; do humann --input i --nucleotide-subject-coverage-threshold 0 --output ../../humann3/test/{I%.fastq}humann3; done
and unfortunately, the unmapped in genefamilies.tsv is still 30%, twice as much as for humann2 on the same data. Do you have any further suggestions for fine tuning?
/Stef

Hi Stef - Sorry I can’t think of other options to change. Just to double check are you using an equivalent database (eg both UniRef90 for each run of v2 vs v3)?

Thanks,
Lauren

Dear Lauren,
No, I updated the databases to the latest when moving to humann3 but then I would expect to have less unknown since the db should include more, right?

Hi Stef - Yes the latest databases include more UniRef sequences. To determine the differences in unknown reads in your runs have you looked at the total number of species identified, the percent of reads aligned for the nucleotide search, and the percent of reads aligned for the translated search? Comparing these values across a couple samples from your runs of the different versions might help determine the differences in the unknown quantity.

Thank you,
Lauren

Hi Lauren-

While running humann3 on the ChocoPhlAn database. I am getting the following error.
CRITICAL ERROR: The directory provided for ChocoPhlAn contains files ( g__Phascolarctobacterium.s__Phascolarctobacterium_sp.centroids.v296_201901.ffn.gz ) that are not of the expected version. Please install the latest version of the database: 201901b

I am trying to update the latest version but getting this error:
CRITICAL ERROR: Unable to download and extract from URL: http://huttenhower.sph.harvard.edu/humann_data/chocophlan/full_chocophlan.v296_201901b.tar.gz

Can you please help me to get the updated version of these databases which can be used for the analysis. Thanks so much.

Hello - Sorry for the slow response. The URL you have is for the latest database. Can you please try the download again? Sometimes there might be issues in downloads due to the load on our server.

Thank you,
Lauren