I have recently been working with the humann3 outputs (through biobakery) and would like to use deseq for analysis on the the differential expression of the pathways.
While I imagine there are several ways to do this here is what I have done:
- renormalized the humann3 pathabundance.csv to be in CPM
- Converted the CPMs to a rough counts matrix by multiplying all CPMs by the “total reads” in their library found in humann_read_and_species_count_table.tsv
- Analyzed this count table in deseq2.
Is this a valid approach? Does humann3 do anything to drastically change the data?
This procedure should work, but it might raise eyebrows in a paper in that the value of working with counts is mainly wrapped up in working with raw counts (i.e. not performing any sort of upstream normalization on the raw data, and treating 0 counts vs 1 count vs 2 counts as meaningful). One of the steps HUMAnN performs internally is normalizing mapped reads vs. gene length to report RPK units. My understanding is that methods like DEseq opt to directly model the fact that longer genes will have higher counts than shorter genes, so I’m not sure how it will perform with that information already normalized out.