I couldn’t find this explicitly stated in the documentation:
I presume for pathway abundances, it allowed genes with identified functions to be assigned to > 1 pathway, and therefore the quantification is somewhat overlapping?
How could you take this into consideration with analysing results statistically? (I’d love to correlate differences in pathway abundance with a continuous metadata variable I have for the samples).
I used limma-trend on logCPM values calculated from the RPK values for analysing gene family abundance. Is this suitable, or would a different method be preferable
All help is much appreciated
Bumping this thread for attention
Apologies for missing this question before the bump.
You’re correct about genes potentially contributing to multiple reactions which themselves can contribute to multiple pathways. We typically don’t take any special steps to compensate for these phenomena, instead just sum-normalizing pathways to relative abundance (or CPM) units and then performing metadata associations with MaAsLin 2 or other microbiome-appropriate methods. Sum-normalizing has the effect of adjusting for sequencing depth, but also allows you to focus on relative coverage of pathways within and between samples.
An alternative approach would be to sum-normalize your genes to CPM units and then run them BACK through HUMAnN to directly compute pathway abundance in CPM units. This approach is a little bit “cleaner” in that it doesn’t count any reads twice when adjusting for compositionality, and may also better handle data sets with large differences in the fraction of read mass assigned to pathways.
Both approaches are valid and we typically use the former for simplicity. As long as you describe your choice clearly (and are aware of potential PROs/CONs) you should be in good shape.