Kneaddata FASTQ header problem

Version 0.7.5 of kneaddata does not handle FASTQ headers as seamlessly as earlier versions. With paired FASTQ files for which the headers for both read_1.fastq and read_2.fastq are,

'at’SRR7280791.1 1 length=101
'at’SRR7280791.2 2 length=101
'at’SRR7280791.3 3 length=101

where ‘at’ represents the at symbol. ( The forum script interprets the at symbol as a reference to another user. )
kneaddata reports different read counts for paired output files kneaddata_paired_1.fastq and kneaddata_paired_2.fastq. After massaging the headers to,

'at’SRR7280791.1/1
'at’SRR7280791.2/1
'at’SRR7280791.3/1

for read_1.fastq and,

'at’SRR7280791.1/2
'at’SRR7280791.2/2
'at’SRR7280791.3/2

for read_2.fastq, kneaddata reports identical read counts for the paired output files.

Hi, a similar problem was found in my project.
The output file paired-1 and paired-2 have same counts, but the order of their header (or ID) is obviously different.
paired-1


paired-2

It is serious problem to will cause the error of assembly and other workflow.
If I reorder the ID of paired-1 and paired-2 that generated by kneaddata, whether it is right or not?

Hi, Thank you for the detailed post. Kneaddata does not require the pairs be ordered. Also other tools in the bioBakery suite do not require the reads to be ordered (eg HUMAnN, MetaPhlAn, StrainPhlAn, PanPhlAn).

Thank you,
Lauren

Thank you. indeed, the bioBakery suite do not require the reads to be ordered. But the assembly of paired reads might need the same header, such as megahit. If not, it could produce different assemlby.

Before performed by kneaddata, the paired reads, namely raw reads, have same counts and same order. however, after the paired reads were analyzed using the kneaddata (v0.7.10), the clean pairs count was same but the ID order of clean pairs was different.

Chauncey

Hi Chauncey, Thank you for the follow up note. Good point in that assembly based methods will require the reads for the pairs to be in the same order. If you run kneaddata with the option --reorder it should keep the read pairs in the same order as the original input files. We did not set this to run by default because it does increase the runtime and we don’t always need the pairs to be ordered.

Thank you,
Lauren

Thank you very much.

Hi Lauren. I am struggling a bit because the headers have been a bit problematic for me. I download the runs from the SRA and get headers like this

forward
@SRRxxxx.1 1 length=151
@SRRxxxx.2 2 length=151
@SRRxxxx.3 3 length=151

reverse
@SRRxxxx.1 1 length=151
@SRRxxxx.2 2 length=151
@SRRxxxx.3 3 length=151

I am using kneaddata like this:

  kneaddata --threads xxx 
    --input SRRxxx_1.fastq --input SRRxxx_2.fastq \
    --output out \
    --reference-db databases/human_genome_index/ \
    --trimmomatic-options "ILLUMINACLIP:adapters.fa:2:30:10: SLIDINGWINDOW:4:20 MINLEN:50" --trimmomatic /data/\
    --bowtie2-options "--very-sensitive --dovetail"

I get different number of reads for the forward and reverse reads probably because of the read identifier issue. But I also cant have read identifiers because I use bwa down the line. Do you have any ideas on how to format the headers for preprocessing and also then how to change the headers again so I can use bwa which requires identical pair end reads?

Would appreciate any guidance as i am truly clueless

Hi - Thank you for your detailed post and sorry for any confusion about the sequence identifiers. We are working on the next release of Kneaddata which will handle the case of read pairs that do not have the expected identifiers for internal tracking.

Since the sequence identifiers on the forward and reverse strands are the same I would recommend adding a “/1” and “/2” to the ends of the forward and reverse identifiers, respectively. This way they should be tracked correctly and also not appear as the same read when run through bowtie2. Sorry for this inconvenience and please know you will not have to make this manual change with the next release of Kneaddata.

I am not familiar with bwa not allowing specific identifier formats. If you would post what specifically it does not allow I can help debug from there.

Thank you,
Lauren