Pathway abundance and cross-samples comparison

I’ve some Humann2 results and I’m still not sure (1) how the abundance is calculated and (2) if comparing abundance values between samples with different depths of coverage is possible.

From the humann2 publication, I can read : Per-gene alignment statistics are weighted based on alignment quality, coverage, and sequence length to yield gene abundance values.

Does this mean that gene abundance is calculated as the mean positions depth, with some thresholds on read alignment quality ?

Then, from the github page : This file details the abundance of each pathway in the community as a function of the abundances of the pathway’s component reactions, with each reaction’s abundance computed as the sum over abundances of genes catalyzing the reaction., which is further developped with : In greater detail, the abundance for each pathway is a recursive computation of abundances of sub-pathways with paths resolved to abundances based on the relationships and abundances of the reactions contained in each. Each path, the smallest portion of a pathway or sub-pathway which can’t be broken down into sub-pathways, has an abundance that is the max or harmonic mean of the reaction abundances depending on the relationships of these reactions. Optional reactions are only added to the overall abundance if their abundance is greater than the harmonic mean of the required reactions.

All of this make the abundance value of a pathway difficult to understand. Does a pathway abundance value of 5 mean that you most likely have 5 complete copies of this pathway in your dataset ? If so, does this mean that the relative abundance of your gene is relative to the depth of coverage of your sample, meaning that cross-samples comparison should not be done ?

Regards,
jsgounot

1 Like

Sorry for the late reply - A pathway abundance of “5” means that the “weakest link” in the pathway was a gene/reaction with coverage of 5 RPK units. Within a sample, these units are proportional to the number of complete copies of the pathway, with the actual number of complete copies being a difficult quantity to measure (it would depend on the number of cells in the original biosample). To compare between samples you’d want to normalize these RPK units to adjust for differences in sequencing depth, e.g. by sum-normalizing them.