The bioBakery help forum

Higher number of reads after trimmed + contaminated step cf. raw reads?

Hi, I just wanted to ask why, when I run kneaddata on my pair-end samples, there are more samples in

decontaminated human.index pair1 and decontaminated human.index pair2 than in

raw pair1 and raw pair2.

in addition, how come decontaminated human.index pair1 & decontaminated human.index pair2 do not have equal numbers of reads i.e. why is there a discrepancy between them after running Bowtie2 step to remove contaminants?

Many thanks!

Sample raw pair1 raw pair2 trimmed pair1 trimmed pair2 trimmed orphan1 trimmed orphan2 decontaminated human.index pair1 decontaminated human.index pair2 decontaminated human.index orphan1 decontaminated human.index orphan2 final pair1 final pair2 final orphan1 final orphan2
ERRXXXX_1_kneaddata 15181542 15181542 12435116 12435116 1937860 309404 23688722 2711437 716247 8 23688722 2711437 716247 8

Hi, Thank you for the detailed post and sorry for the confusion with the read counts. I think kneaddata is having an issue tracking the pairs of reads because of the format of the sequence identifier. It looks like the total number of raw reads and reads after decontamination are expected but I agree with you that the numbers for the pairs are unexpected. Would you check the first few lines of your input files and review the format of the sequence identifier? If it does not include the pair number then you would need to include that for kneaddata to track the pairs. If it does include the pair number then currently if you would change the format to one of the two expected for kneaddata it will resolve the issue you are seeing. We will also make a note to look at making kneaddata a bit more flexible in the future with sequence identifiers.

Two illumina formats

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
@EAS139:136:FC706VJ:2:2104:15343:197393 2:Y:18:ATCACG

or (this is format flexible, just requiring the ids to end in 1 and 2)


Thank you,