Hello,
I have recently been working with the humann3 outputs (through biobakery) and would like to use deseq for analysis on the the differential expression of the pathways.
While I imagine there are several ways to do this here is what I have done:
- renormalized the humann3 pathabundance.csv to be in CPM
- Converted the CPMs to a rough counts matrix by multiplying all CPMs by the “total reads” in their library found in humann_read_and_species_count_table.tsv
- Analyzed this count table in deseq2.
Is this a valid approach? Does humann3 do anything to drastically change the data?
Thanks!
1 Like
This procedure should work, but it might raise eyebrows in a paper in that the value of working with counts is mainly wrapped up in working with raw counts (i.e. not performing any sort of upstream normalization on the raw data, and treating 0 counts vs 1 count vs 2 counts as meaningful). One of the steps HUMAnN performs internally is normalizing mapped reads vs. gene length to report RPK units. My understanding is that methods like DEseq opt to directly model the fact that longer genes will have higher counts than shorter genes, so I’m not sure how it will perform with that information already normalized out.
1 Like
Thank you for sharing this approach! I’m interested in using the counts matrix for additional analysis as well. However, I encountered an issue when attempting to replicate your pipeline: the file humann_read_and_species_count_table.tsv
was not found. Could you please guide me on where to locate this file?
If you are using biobakery, it should be here: ${output_option}/humann/counts/humann_read_and_species_count_table.tsv
I have more recently switched to just using humann3, not through biobakery, so I use the overall size of the library.