The bioBakery help forum

Humann3 error with metaphlan

Hi, I’m having trouble ever since humann3 updated.
Running into this error i’m having a hard time deciphering.

(h3) billyc59@nia-login03:~$ humann -i $read1 --nucleotide-database $chocophlan --protein-database $uniref --threads 40 -o $output
Output files will be written to: /scratch/j/jparkin/billyc59/kneaddata_run/h3/kimchi/66

Running metaphlan …

CRITICAL ERROR: Error executing: /gpfs/fs1/home/j/jparkin/billyc59/.virtualenvs/h3/bin/metaphlan /scratch/j/jparkin/billyc59/kneaddata_run/cleaned_kimchi/kimchi_SRR443366_kneaddata.fastq -t rel_ab -o /scratch/j/jparkin/billyc59/kneaddata_run/h3/kimchi/66/kimchi_SRR443366_kneaddata_humann_temp/kimchi_SRR443366_kneaddata_metaphlan_bugs_list.tsv --input_type fastq --bowtie2out /scratch/j/jparkin/billyc59/kneaddata_run/h3/kimchi/66/kimchi_SRR443366_kneaddata_humann_temp/kimchi_SRR443366_kneaddata_metaphlan_bowtie2.txt --nproc 40

Error message returned from metaphlan :
Warning! Biom python library not detected!
Exporting to biom format will not work!
Traceback (most recent call last):
File “/gpfs/fs1/home/j/jparkin/billyc59/.virtualenvs/h3/bin/read_fastx.py”, line 8, in
sys.exit(main())
File “/gpfs/fs1/home/j/jparkin/billyc59/.virtualenvs/h3/lib/python3.8/site-packages/metaphlan/utils/read_fastx.py”, line 167, in main
f_nreads, f_avg_read_length = read_and_write_raw(f, opened=False, min_len=min_len, prefix_id=prefix_id)
File “/gpfs/fs1/home/j/jparkin/billyc59/.virtualenvs/h3/lib/python3.8/site-packages/metaphlan/utils/read_fastx.py”, line 129, in read_and_write_raw
nreads, avg_read_length = read_and_write_raw_int(inf, min_len=min_len, prefix_id=prefix_id)
File “/gpfs/fs1/home/j/jparkin/billyc59/.virtualenvs/h3/lib/python3.8/site-packages/metaphlan/utils/read_fastx.py”, line 82, in read_and_write_raw_int
for record in parser(uio.StringIO("".join(r))):
File “/home/j/jparkin/billyc59/.local/lib/python3.8/site-packages/biopython-1.78-py3.8-linux-x86_64.egg/Bio/SeqIO/QualityIO.py”, line 945, in FastqGeneralIterator
raise ValueError(“Sequence and quality captions differ.”)
ValueError: Sequence and quality captions differ.

This appears to be an error that occurs in a misformated FASTQ file in which, for at least one of the reads, there are more nucleotide characters than matching quality characters. It frequently happens if a FASTQ file was being downloaded/generated and the process is interrupted, such that the file ends in the middle of writing the quality values for whatever read it was considering.

Can you check if this is the case for your input file?

to my knowledge, kneaddata processed this file with no issues.
I even built a script to check the integrity of the fastq file.
There were no issues with the fastq input. All reads’ sequence lengths were the same length as their quality indicators

Reformatting file sequence identifiers …

Initial number of reads ( /scratch/j/jparkin/billyc59/kneaddata_run/cleaned_human_oral/reformatted_identifierswguitvsz_9h_pair1 ): 30441053.0
Running Trimmomatic …
Total reads after trimming ( /scratch/j/jparkin/billyc59/kneaddata_run/cleaned_human_oral/9h_pair1_kneaddata.trimmed.fastq ): 29525893.0
Running trf …
Decontaminating …
Running bowtie2 …
Running bowtie2 …
Total reads after removing those found in reference database ( /scratch/j/jparkin/billyc59/kneaddata_run/cleaned_human_oral/9h_pair1_kneaddata__bowtie2_clean.fastq ): 29474966.0
Total reads after removing those found in reference database ( /scratch/j/jparkin/billyc59/kneaddata_run/cleaned_human_oral/9h_pair1_kneaddata_SILVA_128_LSUParc_SSUParc_ribosomal_RNA_bowtie2_clean.fastq ): 28549607.0
Total reads after merging results from multiple databases ( /scratch/j/jparkin/billyc59/kneaddata_run/cleaned_human_oral/9h_pair1_kneaddata.fastq ): 28543475.0

Final output file created:
/scratch/j/jparkin/billyc59/kneaddata_run/cleaned_human_oral/9h_pair1_kneaddata.fastq

Thanks for confirming. Have you tried running this on just a single thread? You could also try first running just the MetaPhlAn step on a single thread to see if that works successfully (since that is where you’re getting stuck now). We had another recent post where a person was having trouble running on a large number of threads, possibly because something was getting out of whack during sequence I/O (if something is interrupting the continuous reading of the FASTQ, it might manifest as a sequence len != quality string len error).

1 Like

tried several things:

  1. humann, limited it to 1 thread
  2. ran the metaphlan command humann produced
  3. ran the metaphlan command with --nproc 1

All 3 yielded the same problem as in the initial question.
Using metaphlan version 3.0.13.

I think I’m better off downgrading at this point. There was a version of humann3 that uses metaphlan 2. is that one still around?

@lauren.j.mciver caught that the error here is not because the sequence and quality strings differ in length (as I had surmised), but rather that their captions differ (thanks Lauren!). E.g. something like this:

>seq1
TTGAGAGCGCGAGAG
+seq1_qc
8fhdsahf8dfs329

Can you check if this is the case for your input file? The > and + captions can be the same OR the + caption can be empty, but they can’t otherwise differ.

1 Like

Thanks! That did the trick. I had to scrub all data after the “+” in the 3rd row of the fastq

When did this change come about? I’m trying to reproduce some data for publication purposes. This error wasn’t present in my last run of humann3

Here’s the simple script I made, in case someone else has the same issue.

import os
import sys
from datetime import datetime as dt

if __name__ == "__main__":
    print(dt.today(), "starting scrubbing")
    fastq_in_file = sys.argv[1]
    out_dir = sys.argv[2]
    out_name = os.path.basename(fastq_in_file)
    print("out:", out_name)
    with open(os.path.join(out_dir, out_name), "w") as out_file:
        with open (fastq_in_file, "r") as in_file:
            for line in in_file:
                if(line.startswith("+")):
                    out_file.write("+" + "\n")
                else:
                    out_file.write(line)
    
    print(dt.today(), "finished scrubbing")

This is a feature of the underlying Biopython fastq parser - it’s not something we specifically implemented / changed in the development of HUMAnN 3. This is the first time I’ve seen the issue come up - glad we were able to diagnose and solve it! :slight_smile:

might I recommend a change to kneaddata too? to add in this extra behaviour to scrub the “+” of tags