Metaphlan error after processing fastq using kneaddata v0.12.0

Hi,

I had the same problem posted in the following link.
Continuing the discussion from Humann3 error with metaphlan:

In my experience, the error occurred only when processing reads using kneaddata with --unpaired flag. (Not occurred when using with --input1 and --input2 flags)

--unpaired flag keep caption for + in fastq file, but --input1 and --input2 flags remove caption for + in fastq file.

Thus, in my thought, this error is raised because of behavior of kneaddata only when processing single input fastq file.

Here’s more details.

I downloaded fastq data from SRA and running kneaddata v0.12.0.

Kneaddata Command:

kneaddata --unpaired ERR2238690.fastq.gz -db $BOWTIE2_DB_PATH \
--run-fastqc-start --run-fastqc-end  \
-remove-intermediate-output --output kneaddata_output

I got kneaddata output file ERR2238690_kneaddata.fastq and then compressed it using pigz.

Thus, the final input file of metphlan is ERR2238690_kneaddata.fastq.gz.

When running MetaPhlAn version 4.0.3 (24 Oct 2022), I got error message as followed:

metaphlan Command:

metaphlan ERR2238690_kneaddata.fastq.gz --input_type fastq \
-o metagenome_profile.txt --bowtie2out metagenome.bowtie2.bz2 \
--bowtie2db $DB_PATH

metaphlan Return:

Traceback (most recent call last):
  File "/home/baejw/anaconda3/envs/biobakery/bin/read_fastx.py", line 10, in <module>
    sys.exit(main())
  File "/home/baejw/anaconda3/envs/biobakery/lib/python3.8/site-packages/metaphlan/utils/read_fastx.py", line 168, in main
    f_nreads, f_avg_read_length = read_and_write_raw(f, opened=False, min_len=min_len, prefix_id=prefix_id)
  File "/home/baejw/anaconda3/envs/biobakery/lib/python3.8/site-packages/metaphlan/utils/read_fastx.py", line 130, in read_and_write_raw
    nreads, avg_read_length = read_and_write_raw_int(inf, min_len=min_len, prefix_id=prefix_id)
  File "/home/baejw/anaconda3/envs/biobakery/lib/python3.8/site-packages/metaphlan/utils/read_fastx.py", line 83, in read_and_write_raw_int
    for record in parser(uio.StringIO("".join(r))):
  File "/home/baejw/anaconda3/envs/biobakery/lib/python3.8/site-packages/Bio/SeqIO/QualityIO.py", line 950, in FastqGeneralIterator
    raise ValueError("Sequence and quality captions differ.")
ValueError: Sequence and quality captions differ.

I checked input of metaphlan input file ERR2238690_kneaddata.fastq.gz and found that it has different caption strings for @ and +.

Command:

$ zcat ERR2238690_kneaddata.fastq.gz | head

Return:

@ERR2238690.1.1.length=140#0/1
CTGCNGGCTACCATTTCGAAGCGTCAAAAGGTCTGGCAGATGACGACCATTTTGTCGGTCTGGCTATCGACGAAGACCGTCAGCCGGAACTGACCGCTGAACGTGTAGAAAAATGGGTTAAACAGATTTCTGAAGAGCTG
+ERR2238690.1 1 length=140
@DDB#<<DHHHGHEEHHEHIGHDCGE@HHHIEHHHIFHEHHIIIIHHHHIIEHHHEH<EHHH1FEHHIEFHGH?HGHIHIHCEHFCCHHIIGHIIHI<C<CCHCGF?EHHHII@?DHEEHIHIIGHIHHHICHIHIIEGH
@ERR2238690.2.2.length=150#0/1
GTGATGGATCGCTGCCCGGCTATTGAGATCCCTCGCCTGGGCCTGGCCAAATAAAAAATCCCCGGAAGGCAAAAACCTTCCGGGGATTTGTTCAGGGAATAGTTACGCAGACGCG
+ERR2238690.2 2 length=150
D<0<DCFCHGGHHIIIH@CHHIHIIHHFHIICHHIICEH?CHHHEHH@?G1DFGHHIHEE1FCHIIHGHHHEEG10D1<<1<<CH?0C1D<F1D11FC1<F<<<<1<E=/<C?/C
@ERR2238690.3.3.length=121#0/1
GATTGGTAATGCTCACATCGTCGATACGCTCGACGAAGCGTTAGCCGGTTGTAGTCTGGTGGTGGGCACCAGCGCTCCCGCACGCTGCCGTGGCCGATGCTCGACCCGCGCGAATGCGGTC

For example, the first read’s captions are @ERR2238690.1.1.length=140#0/1 and +ERR2238690.1 1 length=140 which had different strings.

I found that #0/1 in the caption for @ is not in the original fastq file and generated after running kneaddata.

Captions for original downloaded file ERR2238690.fastq.gz:

Command:

zcat ERR2238690.fastq.gz | head

Return:

@ERR2238690.1 1 length=140
CTGCNGGCTACCATTTCGAAGCGTCAAAAGGTCTGGCAGATGACGACCATTTTGTCGGTCTGGCTATCGACGAAGACCGTCAGCCGGAACTGACCGCTGAACGTGTAGAAAAATGGGTTAAACAGATTTCTGAAGAGCTG
+ERR2238690.1 1 length=140
@DDB#<<DHHHGHEEHHEHIGHDCGE@HHHIEHHHIFHEHHIIIIHHHHIIEHHHEH<EHHH1FEHHIEFHGH?HGHIHIHCEHFCCHHIIGHIIHI<C<CCHCGF?EHHHII@?DHEEHIHIIGHIHHHICHIHIIEGH
@ERR2238690.2 2 length=150
GTGATGGATCGCTGCCCGGCTATTGAGATCCCTCGCCTGGGCCTGGCCAAATAAAAAATCCCCGGAAGGCAAAAACCTTCCGGGGATTTGTTCAGGGAATAGTTACGCAGACGCGGGGCCTGGAGTTGTTTGCGGATGGTCTGCGCCAGC
+ERR2238690.2 2 length=150
D<0<DCFCHGGHHIIIH@CHHIHIIHHFHIICHHIICEH?CHHHEHH@?G1DFGHHIHEE1FCHIIHGHHHEEG10D1<<1<<CH?0C1D<F1D11FC1<F<<<<1<E=/<C?/C//CEHHHE0CHH?DGFHHHH/ED0:DFHIDCD-?D
@ERR2238690.3 3 length=121
GATTGGTAATGCTCACATCGTCGATACGCTCGACGAAGCGTTAGCCGGTTGTAGTCTGGTGGTGGGCACCAGCGCTCCCGCACGCTGCCGTGGCCGATGCTCGACCCGCGCGAATGCGGTC

When I ran kneaddata v.0.12.0 with paired and files using options --input1 and --input2 instead of --unpaired, caption for “+” is removed and no error occurred when running metaphlan.

Original fastq file ERR5947567.fastq.gz:

@ERR5947567.1 1 length=148
CTCCATTCACTTTACGAATAGCAATTTCTTTCGAACGTCGCTGTAATTNATCATTAATATAGCCTATCAGTCCCATGACAGTAATAAACAAGATGGCATTTGCCTCCAATAAAGTCGCGTTACGTAATACACGAACGGAATTATAAGA
+ERR5947567.1 1 length=148
,,FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFF,:FFFFFF:#FFF::FF:,FFFFF:FFF,FF:,FFFF,F,:FFFFF:FFFFFF,F:FFFF,FFF,,FFFFFF,FF::,F,,,:FF,FF:,FF:FFFF::FF:FFFF:,F
@ERR5947567.2 2 length=151
CAGGCTGTTGCTAGTTCTTCTCGGAGGGCTTGTTGTCGCCCGACTGGAGCTTGCCGATACACTCGTCGTAATGGCTGTTGATATCGGCTATACGGTTGTCGTACGTCGAATAGTCGAAAGGCTGTGCGTCGGTGGCGCTGTCGATTTCCGT
+ERR5947567.2 2 length=151
F:FF:FFFFF,FFFFF:FF,F:FFFF:,,FFF:FFF:FFF,F,:FFFFFFF:F::::FF,,,FFFF:FFF,FFF::,FF,:::F,FF,F:F,,F,F,,:FFFF,FF,F:FFFFF:F:F,FF,FFFF:,F,,FFFF,F,FFFFFFFFF:,FF
@ERR5947567.3 3 length=150
CTGCCCACCCGGGTGATCCGGGGATCCTCCCGGATCTTGAACATTACGCCGCCCGCGATCTCATTTTGCGACAGCAGCTCTGACTTTCTCTCTTGCGCTTCCAGATACATGGTACGGACTGTATAGATCTGACATGTCCTGCCGTTTTTC

Kneaddata output file ERR5947567_1_kneaddata_paired_1.fastq.gz:

@ERR5947567.10000.10000.length=151#0/1
CACCATAAAAAGCAAGCACATTCTCGCCAGAACATTCCAAAGTTTCGGAATAGCAATGCTTTGCGTACTACTGGCAGCAAAGTAAAATTGATTTGTCTGGAGGGAGCGCAGTTTACCGCCTGCCCATTTCATGCTGTCTTCAAAAAATATC
+
FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFFF:,FFFFFFFFFFFFFFF:F:FFFFFFFFF:FF:FF,FFFFFF:FFFFFF,FFFF:FF,FF::F:FFF,FFFFFFFFF:FFFFFFFF,FF::FFFFF:,FFFFFFFFFFFF,FFF
@ERR5947567.100000.100000.length=131#0/1
ATCCAGCAGTGTCACCTCCACGCCGTACTTCTTGGGCAGATACTCCCGAAACAGCAGGTTCGTGGCGCTGTAACACACCTCGGAGACGATGGCGTGGTCGCCGGAGTTCAGCGTGGTGGTAAACACGGCGA
+
FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:FFFF:FFFFFFFFFFFF:FFFFFFFFF:FFFFFFF
@ERR5947567.1000000.1000000.length=151#0/1
GCAGAACAAATTGCATCTGGTGTATCAAATGGTTCTAGTATGAACCAGATAGCAGCATCTATTGTTTTGTCCCGAATAGCTCATGAGCGGACCCCACTGCATGCAATCACAACAGATTGGAAGGATGATGTTGTATTGGCATATACGTTTT

I was wondering if you could check this situation and fixed it for convenience?

Best regards,