Hi,
I had the same problem posted in the following link.
Continuing the discussion from Humann3 error with metaphlan:
In my experience, the error occurred only when processing reads using kneaddata with --unpaired
flag. (Not occurred when using with --input1
and --input2
flags)
--unpaired
flag keep caption for +
in fastq file, but --input1
and --input2
flags remove caption for +
in fastq file.
Thus, in my thought, this error is raised because of behavior of kneaddata
only when processing single input fastq file.
Here’s more details.
I downloaded fastq data from SRA and running kneaddata v0.12.0
.
Kneaddata Command:
kneaddata --unpaired ERR2238690.fastq.gz -db $BOWTIE2_DB_PATH \
--run-fastqc-start --run-fastqc-end \
-remove-intermediate-output --output kneaddata_output
I got kneaddata output file ERR2238690_kneaddata.fastq
and then compressed it using pigz
.
Thus, the final input file of metphlan is ERR2238690_kneaddata.fastq.gz
.
When running MetaPhlAn version 4.0.3 (24 Oct 2022)
, I got error message as followed:
metaphlan Command:
metaphlan ERR2238690_kneaddata.fastq.gz --input_type fastq \
-o metagenome_profile.txt --bowtie2out metagenome.bowtie2.bz2 \
--bowtie2db $DB_PATH
metaphlan Return:
Traceback (most recent call last):
File "/home/baejw/anaconda3/envs/biobakery/bin/read_fastx.py", line 10, in <module>
sys.exit(main())
File "/home/baejw/anaconda3/envs/biobakery/lib/python3.8/site-packages/metaphlan/utils/read_fastx.py", line 168, in main
f_nreads, f_avg_read_length = read_and_write_raw(f, opened=False, min_len=min_len, prefix_id=prefix_id)
File "/home/baejw/anaconda3/envs/biobakery/lib/python3.8/site-packages/metaphlan/utils/read_fastx.py", line 130, in read_and_write_raw
nreads, avg_read_length = read_and_write_raw_int(inf, min_len=min_len, prefix_id=prefix_id)
File "/home/baejw/anaconda3/envs/biobakery/lib/python3.8/site-packages/metaphlan/utils/read_fastx.py", line 83, in read_and_write_raw_int
for record in parser(uio.StringIO("".join(r))):
File "/home/baejw/anaconda3/envs/biobakery/lib/python3.8/site-packages/Bio/SeqIO/QualityIO.py", line 950, in FastqGeneralIterator
raise ValueError("Sequence and quality captions differ.")
ValueError: Sequence and quality captions differ.
I checked input of metaphlan input file ERR2238690_kneaddata.fastq.gz
and found that it has different caption strings for @ and +.
Command:
$ zcat ERR2238690_kneaddata.fastq.gz | head
Return:
@ERR2238690.1.1.length=140#0/1
CTGCNGGCTACCATTTCGAAGCGTCAAAAGGTCTGGCAGATGACGACCATTTTGTCGGTCTGGCTATCGACGAAGACCGTCAGCCGGAACTGACCGCTGAACGTGTAGAAAAATGGGTTAAACAGATTTCTGAAGAGCTG
+ERR2238690.1 1 length=140
@DDB#<<DHHHGHEEHHEHIGHDCGE@HHHIEHHHIFHEHHIIIIHHHHIIEHHHEH<EHHH1FEHHIEFHGH?HGHIHIHCEHFCCHHIIGHIIHI<C<CCHCGF?EHHHII@?DHEEHIHIIGHIHHHICHIHIIEGH
@ERR2238690.2.2.length=150#0/1
GTGATGGATCGCTGCCCGGCTATTGAGATCCCTCGCCTGGGCCTGGCCAAATAAAAAATCCCCGGAAGGCAAAAACCTTCCGGGGATTTGTTCAGGGAATAGTTACGCAGACGCG
+ERR2238690.2 2 length=150
D<0<DCFCHGGHHIIIH@CHHIHIIHHFHIICHHIICEH?CHHHEHH@?G1DFGHHIHEE1FCHIIHGHHHEEG10D1<<1<<CH?0C1D<F1D11FC1<F<<<<1<E=/<C?/C
@ERR2238690.3.3.length=121#0/1
GATTGGTAATGCTCACATCGTCGATACGCTCGACGAAGCGTTAGCCGGTTGTAGTCTGGTGGTGGGCACCAGCGCTCCCGCACGCTGCCGTGGCCGATGCTCGACCCGCGCGAATGCGGTC
For example, the first read’s captions are @ERR2238690.1.1.length=140#0/1 and +ERR2238690.1 1 length=140 which had different strings.
I found that #0/1 in the caption for @ is not in the original fastq file and generated after running kneaddata.
Captions for original downloaded file ERR2238690.fastq.gz
:
Command:
zcat ERR2238690.fastq.gz | head
Return:
@ERR2238690.1 1 length=140
CTGCNGGCTACCATTTCGAAGCGTCAAAAGGTCTGGCAGATGACGACCATTTTGTCGGTCTGGCTATCGACGAAGACCGTCAGCCGGAACTGACCGCTGAACGTGTAGAAAAATGGGTTAAACAGATTTCTGAAGAGCTG
+ERR2238690.1 1 length=140
@DDB#<<DHHHGHEEHHEHIGHDCGE@HHHIEHHHIFHEHHIIIIHHHHIIEHHHEH<EHHH1FEHHIEFHGH?HGHIHIHCEHFCCHHIIGHIIHI<C<CCHCGF?EHHHII@?DHEEHIHIIGHIHHHICHIHIIEGH
@ERR2238690.2 2 length=150
GTGATGGATCGCTGCCCGGCTATTGAGATCCCTCGCCTGGGCCTGGCCAAATAAAAAATCCCCGGAAGGCAAAAACCTTCCGGGGATTTGTTCAGGGAATAGTTACGCAGACGCGGGGCCTGGAGTTGTTTGCGGATGGTCTGCGCCAGC
+ERR2238690.2 2 length=150
D<0<DCFCHGGHHIIIH@CHHIHIIHHFHIICHHIICEH?CHHHEHH@?G1DFGHHIHEE1FCHIIHGHHHEEG10D1<<1<<CH?0C1D<F1D11FC1<F<<<<1<E=/<C?/C//CEHHHE0CHH?DGFHHHH/ED0:DFHIDCD-?D
@ERR2238690.3 3 length=121
GATTGGTAATGCTCACATCGTCGATACGCTCGACGAAGCGTTAGCCGGTTGTAGTCTGGTGGTGGGCACCAGCGCTCCCGCACGCTGCCGTGGCCGATGCTCGACCCGCGCGAATGCGGTC
When I ran kneaddata v.0.12.0 with paired and files using options --input1
and --input2
instead of --unpaired
, caption for “+” is removed and no error occurred when running metaphlan
.
Original fastq file ERR5947567.fastq.gz
:
@ERR5947567.1 1 length=148
CTCCATTCACTTTACGAATAGCAATTTCTTTCGAACGTCGCTGTAATTNATCATTAATATAGCCTATCAGTCCCATGACAGTAATAAACAAGATGGCATTTGCCTCCAATAAAGTCGCGTTACGTAATACACGAACGGAATTATAAGA
+ERR5947567.1 1 length=148
,,FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFF,:FFFFFF:#FFF::FF:,FFFFF:FFF,FF:,FFFF,F,:FFFFF:FFFFFF,F:FFFF,FFF,,FFFFFF,FF::,F,,,:FF,FF:,FF:FFFF::FF:FFFF:,F
@ERR5947567.2 2 length=151
CAGGCTGTTGCTAGTTCTTCTCGGAGGGCTTGTTGTCGCCCGACTGGAGCTTGCCGATACACTCGTCGTAATGGCTGTTGATATCGGCTATACGGTTGTCGTACGTCGAATAGTCGAAAGGCTGTGCGTCGGTGGCGCTGTCGATTTCCGT
+ERR5947567.2 2 length=151
F:FF:FFFFF,FFFFF:FF,F:FFFF:,,FFF:FFF:FFF,F,:FFFFFFF:F::::FF,,,FFFF:FFF,FFF::,FF,:::F,FF,F:F,,F,F,,:FFFF,FF,F:FFFFF:F:F,FF,FFFF:,F,,FFFF,F,FFFFFFFFF:,FF
@ERR5947567.3 3 length=150
CTGCCCACCCGGGTGATCCGGGGATCCTCCCGGATCTTGAACATTACGCCGCCCGCGATCTCATTTTGCGACAGCAGCTCTGACTTTCTCTCTTGCGCTTCCAGATACATGGTACGGACTGTATAGATCTGACATGTCCTGCCGTTTTTC
Kneaddata output file ERR5947567_1_kneaddata_paired_1.fastq.gz
:
@ERR5947567.10000.10000.length=151#0/1
CACCATAAAAAGCAAGCACATTCTCGCCAGAACATTCCAAAGTTTCGGAATAGCAATGCTTTGCGTACTACTGGCAGCAAAGTAAAATTGATTTGTCTGGAGGGAGCGCAGTTTACCGCCTGCCCATTTCATGCTGTCTTCAAAAAATATC
+
FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFFF:,FFFFFFFFFFFFFFF:F:FFFFFFFFF:FF:FF,FFFFFF:FFFFFF,FFFF:FF,FF::F:FFF,FFFFFFFFF:FFFFFFFF,FF::FFFFF:,FFFFFFFFFFFF,FFF
@ERR5947567.100000.100000.length=131#0/1
ATCCAGCAGTGTCACCTCCACGCCGTACTTCTTGGGCAGATACTCCCGAAACAGCAGGTTCGTGGCGCTGTAACACACCTCGGAGACGATGGCGTGGTCGCCGGAGTTCAGCGTGGTGGTAAACACGGCGA
+
FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:FFFF:FFFFFFFFFFFF:FFFFFFFFF:FFFFFFF
@ERR5947567.1000000.1000000.length=151#0/1
GCAGAACAAATTGCATCTGGTGTATCAAATGGTTCTAGTATGAACCAGATAGCAGCATCTATTGTTTTGTCCCGAATAGCTCATGAGCGGACCCCACTGCATGCAATCACAACAGATTGGAAGGATGATGTTGTATTGGCATATACGTTTT
I was wondering if you could check this situation and fixed it for convenience?
Best regards,