Problem with NovaSeq sequenced data

Quinten · September 16, 2020, 8:50am

Dear,

I had a question regarding using kneaddata (v0.7.10) on paired end data sequenced using a NovaSeq. Your tool works very nicely (thanks for designing it!) for sequence data with standard Phred scores, but I have issues when working with aggregated Phred scores (in my case only 4 score bins for indicating quality of a base (2,12,23,37)), see the figure below:

While the tool still seems to run fine, I get completely empty cleaned files (the ones ending in kneaddata_paired_1.fastq). I guess something goes wrong in the trimming due to the quality scores the tool perhaps does not expect, although I am not sure? Hope my question is clear, if not, please let me know. Looking forward to hearing from you!

Best,
Quinten

Quinten · October 15, 2020, 9:46am

Hi, I was wondering whether you have had time to look at this issue already, or whether I just missed something simple here?

Thanks a lot in advance,
Quinten

lauren.j.mciver · October 15, 2020, 7:27pm

Hi Quinten, Thank you for the ping on the original post and sorry for the slow response on our end. I am not sure exactly where the reads are being filtered in your run. Would you look for the read counts in your log file or if it is easier if you would post the contents of your log file I would be happy to check it out and see what might be up. My guess is trimmomatic, the tool kneaddata uses for trimming and adapter trimming might not be compatible with the new NovaSeq Phred scores.

Thank you,
Lauren

Quinten · October 16, 2020, 9:31am

Hi Lauren,

No worries, thanks a lot for taking the time.

I have added log files for 3 samples, hopefully they can provide some clarity. I was thinking the same thing, but still it would be somewhat strange I suppose. Considering that the Phred score designator are no changed, there are just only 4 of them left. Perhaps there is some expectation in amount of variation of the quality scores by trimmomatic or something.

I can also point you to the raw sequence files if needed, I uploaded them to ENA already.

Best and thanks a lot in advance,

Quinten

(Attachment H75MHDSXY_103997-001-006_TGTGCGTT-TCGTCTCA_L001_R1_kneaddata.log is missing)

(Attachment H75MHDSXY_103997-001-007_TTGATCCG-GTGAGCTT_L001_R1_kneaddata.log is missing)

(Attachment H75MHDSXY_103997-001-008_ATGGTTGC-AGTTGGCT_L001_R1_kneaddata.log is missing)

Quinten · October 16, 2020, 9:43am

H75MHDSXY_103997-001-008_ATGGTTGC-AGTTGGCT_L001_R1_kneaddata.log.txt (23.0 KB) H75MHDSXY_103997-001-007_TTGATCCG-GTGAGCTT_L001_R1_kneaddata.log.txt (23.0 KB)

Here with the files, wasn’t aware the .log file wouldn’t be accepted. Sorry for that

lauren.j.mciver · October 16, 2020, 9:53pm

Hi - Thank you for posting the logs. It looks like the reads pass through the trimmomatic step okay. However, it looks like kneaddata gets confused with the pair identifiers because all filtered reads look to be placed in the final orphan files. The final pair files are empty. Looking at the format for your input sequences (thank you again for posting all the details) I don’t think kneaddata is picking up the pair identifiers in the sequence identifier lines. Would you possibly try adding “/1” and “/2” to the ends of the sequence ids in the read1 and read2 files, respectively. I think then kneaddata will be able to track the paired reads and you should have an expected set of read1, read2, orphan1, orphan2 for your output files. Sorry for the inconvenience on adding this change to your sequence identifier lines. In the next version of kneaddata it should be able to make this change for you automatically.

Thank you,
Lauren

Quinten · October 18, 2020, 1:48pm

Hi Lauren,

Thanks a lot for taking the time to check the log files, much appreciated. So it has nothing to do with the aggregated quality scores, good to know.

Could you give an example on how to do that perhaps? As I am not a 100% sure what you mean.

No worries, any chance that you know when the next version of kneaddata will be released?

Best,

Quinten

lauren.j.mciver · October 19, 2020, 8:23pm

Hi Quinten, Here are the two commands that you could use to add an id to the end of the sequence ids. The next version of kneaddata is currently under development. I am sorry but I am not sure of a release date yet but hopefully in the next few weeks.

$ sed '1~4 s/$/\/1/g' < original_R1.fastq > new_R1.fastq
$ sed '1~4 s/$/\/2/g' < original_R2.fastq > new_R2.fastq

Thank you,
Lauren

Quinten · October 20, 2020, 7:10am

Hi Lauren,

Thanks a lot, I’ll try out the commands.

I’ll be checking the Github page then in the coming weeks, looking forward!

Best,

Quinten

Topic		Replies	Views
Kneaddata MINLEN behaviour unexpected; Adapters remain unremoved KneadData	0	363	March 30, 2022
Updated kneaddata to fix issue with paired-end reads? KneadData	9	1502	October 12, 2023
Kneaddata trim options, file headers, and consistent options flags KneadData	0	659	January 13, 2022
Kneaddata paired output have different names KneadData	0	27	September 19, 2024
Paired input to kneaddata resulting in unpaired files KneadData	1	355	June 29, 2024

Problem with NovaSeq sequenced data

Related topics