Deseq2 analysis of Humann3 outputs - clarification

Hello,

I want a clarification of this topic:

I only use Humann3 so I don’t have the humann_read_and_species_count_table.tsv file from the Biobakery workflow.

To get the raw counts for the use of DESEQ2, I need to remove the length normalisation of the RPK.
To do this, I need to get the total reads as described in the quoted topic above.
From KneadData before Humann I can get the number of reads here:

04/10/2024 02:46:07 AM - kneaddata.utilities - INFO: READ COUNT: final pair1 : Total reads after merging results from multiple databases ( out_paired_1.fastq ): 18018168.0

According to the biobakery script, the script retrieves the number of reads in the humann.log :
04/10/2024 03:22:43 AM - humann.utilities - DEBUG: b’36036336 reads; of these:

I don’t know if I really need to use 36036336, I think it’s better to use 18018168, maybe it won’t change much.

36036336 is the number of reads in the out_paired_1 file multiplied by two because it’s in paired end (there’s also an out_paired_2.fastq file with the same number of reads).

For example on this file in CPM:

Pathway humann_Abundance-CPM

UNMAPPED 804106
UNINTEGRATED 174839
UNINTEGRATED|unclassified 168038
UNINTEGRATED|g__Prevotella.s__Prevotella_sp_tc2_28 3679.22
PWY-1042: glycolysis IV 2030.87
GLUCONEO-PWY: gluconeogenesis I 814.428

For GLUCONEO-PWY: gluconeogenesis I do:
814.428 / 1 000 000 * 18018168 = 14674.5005279

=> 14674.5005279
So my count is a decimal, is that OK?
Do you think I should round this count ?

Cheers,
Jérémy Tournayre

We did not create DESeq2, so I can’t speak to how it would handle pseudo-count data provided as a floating point number versus an integer. I would assume they convert to integers via truncating/rounding at some point, but that’s just a guess!

1 Like

Thank you very much!! :slight_smile:

Just to be sure not to make any mistakes, to remove RPK and get the count, if I do this cross product :

CPM * total reads / 1 000 000, it seems ok, no?

I also wonder if I should do total reads or total reads/2 when the data are paired. Maybe it doesn’t change the results in the statistical analysis, I don’t know.

Cheers,
Jérémy Tournayre

That procedure will give you fractions of your total reads as counts. They will only be approximations of the true count however, which should be something more like “the number of reads mapped to genes that were associated with the pathway.” Note that the true count would be higher for pathways with larger genes, or involving more reactions, which are confounders that we typically want to correct for before downstream stats, though approaches like DESeq2 try to learn this information from the data table itself.