Questions about the read count table pulled from kneaddata logs

Hi! I used Kneaddata 0.10.0 to perform quality control to our paired-end fastq reads, and I pulled the information from all the kneaddata logs to check the read count at each filtering steps for the samples using kneaddata_read_count_table command.
I saw in the read count table, there are read count for:
raw pair1, raw pair2
trimmed pair1, trimmed pair2
trimmed orphan1, trimmed orphan2
decontaminated mouse_C57BL_6NJ pair1, decontaminated mouse_C57BL_6NJ pair2
decontaminated mouse_C57BL_6NJ orphan1, decontaminated mouse_C57BL_6NJ orphan2
final pair1, final pair2
final orphan1, final orphan2

What’s the difference between the pair and orphan? Should I look at the columns final pair1 and final pair2 for the final cleaned fastqs?

I also have a question about the --cat-final-output option.

Should we expect exactly same reads in the concatenated fastqs to its pair1 plus pair2? How the paired-end reads are joined to generate the concatenated final output? For example, If final pair1 = 100001 reads, and pair 2 = 100001, should the concatenated fastq be 200002 reads?
I’m seeing in my data that the concatenated fastqs have a little more reads than the sum of pair1&2. Were some additional sequences been used to join the reads?
Any input would be much appreciated!
Thank you so much!!

Best,
Fangxi

Hi,
I think I figured this out so I will just reply to my own post.

The total# of reads in the concatenated paired end final fastq is:
the reads in final pair 1 + the reads in final pair 2 + unmatched 1 + unmatched 2

The final pair 1&2 are the reads passed Situation1 as described in Kneaddata tutorial:

  1. Both reads in the pair pass.

We should cat together the final outputs for further analysis, especially in HUMANN which takes a single input file.

Thank you!

Best,
Fangxi

1 Like