I had a question regarding using kneaddata (v0.7.10) on paired end data sequenced using a NovaSeq. Your tool works very nicely (thanks for designing it!) for sequence data with standard Phred scores, but I have issues when working with aggregated Phred scores (in my case only 4 score bins for indicating quality of a base (2,12,23,37)), see the figure below:
While the tool still seems to run fine, I get completely empty cleaned files (the ones ending in kneaddata_paired_1.fastq). I guess something goes wrong in the trimming due to the quality scores the tool perhaps does not expect, although I am not sure? Hope my question is clear, if not, please let me know. Looking forward to hearing from you!
Hi, I was wondering whether you have had time to look at this issue already, or whether I just missed something simple here?
Thanks a lot in advance,
Hi Quinten, Thank you for the ping on the original post and sorry for the slow response on our end. I am not sure exactly where the reads are being filtered in your run. Would you look for the read counts in your log file or if it is easier if you would post the contents of your log file I would be happy to check it out and see what might be up. My guess is trimmomatic, the tool kneaddata uses for trimming and adapter trimming might not be compatible with the new NovaSeq Phred scores.
No worries, thanks a lot for taking the time.
I have added log files for 3 samples, hopefully they can provide some clarity. I was thinking the same thing, but still it would be somewhat strange I suppose. Considering that the Phred score designator are no changed, there are just only 4 of them left. Perhaps there is some expectation in amount of variation of the quality scores by trimmomatic or something.
I can also point you to the raw sequence files if needed, I uploaded them to ENA already.
Best and thanks a lot in advance,
(Attachment H75MHDSXY_103997-001-006_TGTGCGTT-TCGTCTCA_L001_R1_kneaddata.log is missing)
(Attachment H75MHDSXY_103997-001-007_TTGATCCG-GTGAGCTT_L001_R1_kneaddata.log is missing)
(Attachment H75MHDSXY_103997-001-008_ATGGTTGC-AGTTGGCT_L001_R1_kneaddata.log is missing)
Hi - Thank you for posting the logs. It looks like the reads pass through the trimmomatic step okay. However, it looks like kneaddata gets confused with the pair identifiers because all filtered reads look to be placed in the final orphan files. The final pair files are empty. Looking at the format for your input sequences (thank you again for posting all the details) I don’t think kneaddata is picking up the pair identifiers in the sequence identifier lines. Would you possibly try adding “/1” and “/2” to the ends of the sequence ids in the read1 and read2 files, respectively. I think then kneaddata will be able to track the paired reads and you should have an expected set of read1, read2, orphan1, orphan2 for your output files. Sorry for the inconvenience on adding this change to your sequence identifier lines. In the next version of kneaddata it should be able to make this change for you automatically.
Thanks a lot for taking the time to check the log files, much appreciated. So it has nothing to do with the aggregated quality scores, good to know.
Could you give an example on how to do that perhaps? As I am not a 100% sure what you mean.
No worries, any chance that you know when the next version of kneaddata will be released?
Hi Quinten, Here are the two commands that you could use to add an id to the end of the sequence ids. The next version of kneaddata is currently under development. I am sorry but I am not sure of a release date yet but hopefully in the next few weeks.
$ sed '1~4 s/$/\/1/g' < original_R1.fastq > new_R1.fastq
$ sed '1~4 s/$/\/2/g' < original_R2.fastq > new_R2.fastq
Thanks a lot, I’ll try out the commands.
I’ll be checking the Github page then in the coming weeks, looking forward!