[wmgx workflow] Discrepancy in total read counts between kneaddata and humann

SciLiciumTheo · October 6, 2022, 3:48pm

Hello,

I’m trying biobakery (latest docker image).
I would like to understand how the read count information in output/kneaddata/merged/kneaddata_read_count_table.tsv and in output/humann/counts/humann_read_and_species_count_table.tsv compare to each other.

Example from one of my sample:
kneaddata_read_count_table.tsv
|raw pair1|250000|
|raw pair2|250000|
|trimmed pair1|230495|
|trimmed pair2|230495|
|trimmed orphan1|7075|
|trimmed orphan2|8162|
|decontaminated hg37dec_v0.1 pair1|228858|
|decontaminated hg37dec_v0.1 pair2|228858|
|decontaminated hg37dec_v0.1 orphan1|7064|
|decontaminated hg37dec_v0.1 orphan2|8104|
|final pair1|228858|
|final pair2|228858|
|final orphan1|7064|
|final orphan2|8104|
and in humann_read_and_species_count_table.tsv:
|humann_total_reads|470978|

If I do final pair1 + final pair2 + final orphan1 + final orphan2, I get 472884, which is different (higher) than what humann reports. If I exclude orphans, I only get 457716, which is still different (lower) than what humann reports.

Any hint of where this could stem from?

Thanks again,
Theo

SciLiciumTheo · October 17, 2022, 8:24am

Hello,

I haven’t found any answer for that particular question yet. Could you please give me a clue?

Thanks in advance,
Cheers,
Theo

SciLiciumTheo · October 27, 2022, 4:15pm

Hello everyone,

I finally found the reason!

As you might have noticed, the numbers for decontaminated hg37dec_v0.1 pair1 is identical to final pair1. Same for pair2, and for orphans.
It seems kneaddata does not put the correct number of reads for these categories (i.e. after TRF processing). If I look into KneadData log for this particular sample , I can see that 893 reads have been removed from pair1 file, 941 from pair2, 33 from unmatched1 and 39 from unmatched2.

If I do 472884-(893+941+33+39), I find back the number reported by HUMAnN.

Conclusion: do not trust read number for final files in output/kneaddata/merged/kneaddata_read_count_table.tsv

This is for KneadData 0.7.6

franzosa · December 15, 2022, 3:35pm

Thanks for the follow-up and apologies that we missed this - when posts are made to the parent topics (e.g. “Microbial Community Profiling” as opposed to “KneadData” specifically) they don’t always alert us. Tagging @sagunmaharjann for visibility of this potential issue.

Topic		Replies	Views
Interpreting read_count_table Numbers? KneadData	0	340	February 9, 2023
Deseq2 analysis of Humann3 outputs - clarification HUMAnN	3	98	October 30, 2024
Massive difference between paired reads' counts KneadData	1	631	May 1, 2021
Higher number of reads after trimmed + contaminated step cf. raw reads? KneadData	1	554	July 10, 2020
Questions about the read count table pulled from kneaddata logs KneadData	1	557	February 8, 2023

[wmgx workflow] Discrepancy in total read counts between kneaddata and humann

Related topics