Assuming the pathways were constructed from common shotgun metagenomic procedures, do pathways generated from HUMAnN2 maintain the compositional nature of the data? By “compositional,” I mean that the abundance of any 1 pathway is only interpretable relative to another and there is a sum constraint. Genes abundances are compositional but I am not sure for the pathways. Knowing whether the pathways abundances are compositional or not will infer what transformation/normalization and statistical methods I could use. If the pathways have the one-to-many problem (1 gene/read maps to multiple pathways in a community), then the compositional nature of the data is violated and compositional data analysis methods cannot be applied. Furthermore, if there is the one-to-many problem, I dont see how its valid to normlaize the pathways abundances using a method that is based on using the pathway abundances as a reference (e.g., TSS normalization) and instead, should use a reference that is compositional in nature (for example, if using TSS use the total sum of the gene abundances or perhaps even better, the total sum of the number of organisms as a reference) - the normalization I’m talking about is to removes technical bias related to different library sizes. But from reading the HUMAnN2 paper, it seems that the one-to-many problem is solved. It might be important to note that I plan to assess pathways as a community as a whole. I am not a statistician and my thought process could be absolutely wrong. Please advise.
The short answer here is that we routinely sum-normalize pathway abundance data to reduce the influence of sequencing depth.
Note also that metagenomic measurements are always compositional/relative in the absence of some independent measure of biomass. However, the composition is usually not over the sequencing reads themselves (as in “75% of reads are derived from feature A and 25% from feature B”), which is how I read your “maintain the compositional nature” phrasing? Gene abundances, for example, are proportional to the fold coverage of genes, and not relative recruitment of reads (longer genes would otherwise appear to be artificially more abundant). Similarly, pathway abundances are proportional to the number of complete copies of the pathway (approximated as the coverage of the least-abundant required reaction).
You are correct that some gene mass (and by extension read mass) can contribute to multiple pathway measurements. This would be strange if pathway abundances were simply a “repackaging” of gene or read abundances (in the way that one might repackage species abundances into genus abundances). However, as outlined above, this is not the case.
Thank you for responding even though its the weekend, much appreciated! That’s dedication
In that case, is there still a need to normalize the pathway abundances table produced from HUMAnN2 in order to remove technical bias related to different sequencing depths? I see that folks from bioBakery recommend normalzing using TSS.
By “compositional,” we mean that the abundance of any 1 nucleotide fragment is only interpretable relative to another. This property emerges from the sequencer itself; the sequencer, by design, can only sequence a fixed number of nucleotide fragments. Consequently, the final number of fragments sequenced is constrained to an arbitrary limit so that doubling the input material does not double the total number of counts. This constraint also means that an increase in the presence of any 1 nucleotide fragment necessarily decreases the observed abundance of all other transcripts.
Reference
Quinn TP, Erb I, Gloor G, Notredame C, Richardson MF, Crowley TM. A field guide for the compositional analysis of any-omics data. Gigascience . 2019;8(9):giz107. doi:10.1093/gigascience/giz107
The default outputs from HUMAnN are not adjusted for sequencing depth. They are adjusted based on the lengths of database sequences that contribute to gene/pathway abundance to give RPK units, since this form of normalization is harder to perform post hoc. The RPK units (which are proportional to fold coverage) can then be adjusted for sequencing depth as you see fit. The renorm_table script can perform a few different flavors of TSS normalization on the RPK units, and that’s our usual next step in the analysis workflow.
Thanks for pointing out the Quinn 2019 paper. Your quote from their intro is a more generalized and elegant version of my “metagenomic measurements are always compositional/relative in the absence of some independent measure of biomass” (but we are on the same page). Other, non-TSS approaches to working with compositional data laid out in the paper (e.g. CLR normalization) would also work on HUMAnN’s RPK units. I would be cautious though about any techniques that assume integer read counts as HUMAnN’s RPK values likely violate some of those assumptions.