[wmgx workflow] Discrepancy in total read counts between kneaddata and humann

Hello,

I’m trying biobakery (latest docker image).
I would like to understand how the read count information in output/kneaddata/merged/kneaddata_read_count_table.tsv and in output/humann/counts/humann_read_and_species_count_table.tsv compare to each other.

Example from one of my sample:
kneaddata_read_count_table.tsv
|raw pair1|250000|
|raw pair2|250000|
|trimmed pair1|230495|
|trimmed pair2|230495|
|trimmed orphan1|7075|
|trimmed orphan2|8162|
|decontaminated hg37dec_v0.1 pair1|228858|
|decontaminated hg37dec_v0.1 pair2|228858|
|decontaminated hg37dec_v0.1 orphan1|7064|
|decontaminated hg37dec_v0.1 orphan2|8104|
|final pair1|228858|
|final pair2|228858|
|final orphan1|7064|
|final orphan2|8104|
and in humann_read_and_species_count_table.tsv:
|humann_total_reads|470978|

If I do final pair1 + final pair2 + final orphan1 + final orphan2, I get 472884, which is different (higher) than what humann reports. If I exclude orphans, I only get 457716, which is still different (lower) than what humann reports.

Any hint of where this could stem from?

Thanks again,
Theo

Hello,

I haven’t found any answer for that particular question yet. Could you please give me a clue?

Thanks in advance,
Cheers,
Theo

Hello everyone,

I finally found the reason!

As you might have noticed, the numbers for decontaminated hg37dec_v0.1 pair1 is identical to final pair1. Same for pair2, and for orphans.
It seems kneaddata does not put the correct number of reads for these categories (i.e. after TRF processing). If I look into KneadData log for this particular sample , I can see that 893 reads have been removed from pair1 file, 941 from pair2, 33 from unmatched1 and 39 from unmatched2.

If I do 472884-(893+941+33+39), I find back the number reported by HUMAnN.

Conclusion: do not trust read number for final files in output/kneaddata/merged/kneaddata_read_count_table.tsv

This is for KneadData 0.7.6