One-to-many problem with humann2_regroup_table

Greetings my beloved bioBakery team!

I noticed that the humann2_regroup_table script does not divide the gene abundance by the number of times it is regrouped to an ontology. Why is that? As you indicated elsewhere, allowing a gene to regroup to multiple ontologies creates a one-to-many problem. Consequently the components do not truly sum to 1 and violate a compositional analysis assumption - “each component (gene) should never give more than itself” (quoting Thom Quinn). TSS normalization of the regrouped table can’t fix this. It would be great if humann2 could include that option in the future. I created an R script that addresses the problem by dividing the gene abundance from the input gene table by the number of times the gene is regrouped to an ontology here.

Best regards,
-Ali

The golden rule for abundances as reported by HUMAnN, including those coming from regroup_table, is: the ratio of the abundance of feature A to the abundance of feature B in the output estimates the ratio of the coverage depth of A to the coverage depth of B in the underlying sample.

The one-to-many “problem” is a property of certain grouping relationships. Consider gene A with a single domain A1 and gene B with two domains, B1 and B2. If A and B have abundance (coverage depth) of 1 each, then all three domains should also have abundance of 1. However, if we were to split B’s abundance evenly over its two domains when regrouping it would suggest that B1 and B2 were less-well-covered than A1, which is false.

While the golden rule enables comparisons of features within sample, we still need to do something to correct for differences in sequencing depth when comparing between samples. Sum-normalization is one way to do this (by forcing all features to sum to 1.0 in each sample) while still obeying the golden rule. While some sum-normalized outputs obey other properties of compositional data (e.g. reads mapped to genes obey the “don’t give more than yourself” rule), this is not a guarantee for all grouping relationships (e.g. genes to domains, as in the example above, and reactions to pathways).

2 Likes

Excellent point. I believe you are right that splitting genes evenly across each group it is assigned to can result in false conclusions. However, I would argue that treating groups as parts of a whole (e.g., scaling the groups by the total sum of the groups) will also lead to false conclusions because groups that share membership of a component will increase in abundance while the abundance of the other groups will by necessity decrease in abundance. It seems that the answer lies in treating only the original components as parts of a whole (e.g., scaling the groups or components by the total sum of the components). After all, the additional copies of the components do not exist in reality - the sum of the parts cannot be greater than the whole.

Below I attempt to explain my reasoning in 2 parts. In part 1, I illustrate the false conclusions resulting from: (A) splitting components evenly between their assigned group and (B) treating the additional components as part of a whole. (C ) I provide a solution by treating only the components as part of a whole - in other words, always normalizing based on the component values.

In part 2 I demonstrate how (A) and (B) also result in false conclusions when comparing samples and how (C ) is the best solution of the three. I include demonstrations using total sum scaling and additive log-ratio (alr) transformation.

From now on whenever I say abundance I mean relative abundance.

PART 1: Pizza

Consider slices of a pizza in the figure below. We are interested to know what proportion of pizza slices have the following toppings: olives, pepperoni, and mushrooms. The pizza contains 5 slices. Based on the common understanding, there are 2 (40%) slices with only olives, 1 (20%) of slices with only pepperoni, 1 (20%) with only mushroom, and 1 (20%) slice with both olives and pepperoni. There are no missing slices.

Method A: Components are split evenly between assigned groups

If there is a component that is included in N groups, then the component is split by N groups. Thus, the one slice of pizza with both toppings is cut in half and divided between the pepperoni group and mushroom group. This makes it seem that there are less pepperoni and olive slices and fewer mushroom slices, while the proportion of mushrooms remains unchanged. There were originally 60% of slices with olives, now there is only 50%. This makes it appear that someone ate a proportion of olive slices. Therefore, splitting changes the proportions of groups with shared membership while groups without shared membership remain unchanged. This method makes sense when, for example, we are not certain of where to group a component. For example, if we are trying to profile the microbiome and we do not know if a particular 16S rRNA is either E. Coli or Shigella, and there is an equal likelihood that it could be either and nothing else, then it’s reasonable to split the number of this 16S between the two. Humann also uses a form of the splitting method in cases of uncertainty when a read aligns equally well to multiple databases. The benefit of the split method is that the groups are still defined by the original closure constant and normalization and transformations can be correctly applied to the group abundances. Despite the desirable compositional nature of splitting, it still will make a group appear less or more abundant then it actually is.

Method B: treat additional components as part of a whole.

If a component is classified in N groups then the component has full membership in all N groups. This leads to an increase in the group library size and the more components map to a group the more the library size increases, the larger the bias. The group abundances are forced into 1 whole (sum to 1) to make each group a part of a whole. In other words, the group abundances are scaled by the total sum of the group abundance (aka group library size). While in reality there are only 5 slices of pizza, the groupings suggest there are 6 slices of pizza: 1 pizza + an artificial slice. This method is equivalent to forcing the 6 slices into one round pizza to make it “appear” like there is only one pizza. While there were originally 40% olive slices there are now 50% olive slices. That’s a 50% increase! The abundance of pepperoni slices remains the same. Because the components were forced to a sum of 1 the extra proportion of olives and pepperoni must be equally compensated by decreasing the proportion of mushrooms, decreasing by 10% (from 20% to 10%). The mushroom proportions decreased despite that in reality the number of mushroom slices has not changed before and after scaling the groups. Thus, normalization at groups directly will bias the results if there is a one-to-many issue. In reality, there are only 10 plant foods in Eric’s basket while group-TSS imagines there are 2 extra plant food and falsely normalizes by 12. Forcing the groups to a sum of one still doesn’t change the reality that groups that contain shared membership are not parts of a whole.

Method C: Treat only the original components as part of a whole

The difference between methods B and C is that method C is concerned with that the original components are parts of a whole and is not concerned that the group abundances are parts of a whole, because they are not. Scaling (or the normalization or transformation of your choice) is applied to the original parts of the whole., not directly to the groups. For example, total sum scaling is applied to the components and then regrouped. The proportion of each slice is exactly the same before and after grouping. This method is not concerned that the group library size is larger than the component group size because it only cares about the compositionality and normalization of the original components independent of the groupings. So, normalization and transformations should be addressed at each original component, no the groups directly.

PART 2: Comparing samples

To make this demo easier to follow let us call scaling group counts by the total sum of group counts as “group-TSS” (method B) and scaling group component counts by the total sum of the ungrouped component counts as “component-TSS” (method C).

In this simple demo, we want to answer the question “Did Ali’s garden produce more plant foods then Eric’s garden?”.

Ali’s and Eric’s garden grow 3 types of plant foods: fruits, vegetables, and nuts. A basket that can hold only 10 plant foods was used to randomly sample from each garden once. We do not know the absolute abundance of each plant food. Because the basket can only hold a defined amount of plant foods it is said to have a ceiling and a sum constraint. Because this is a compositional problem it may be more appropriate to rephrase the question as "Are the proportion of plant foods from Ali’s garden different Eric’s garden?

Grouping info:

Fruit: apple, orange, tomato
Vegi: broccoli, cabbage, tomato
Nut: almond, cashew

Let’s decide that a tomato is both a fruit and vegetable to create a one-to-many grouping problem (Apparently, scientifically speaking tomato is considered a fruit because it contains seeds). Don’t fact check me. :sweat_smile:

Count Data:

Table 1: apple orange tomato broccoli cabbage almond cashew
Ali 2 2 0 2 2 1 1 Σ=10
Eric 2 1 2 1 2 1 1 Σ=10
Table 2: fruit vegi nut
Ali 4 4 2 Σ=10
Eric 5 5 2 Σ=12
Ratio Ali:Eric 0.8 0.8 1

Table 3: Same data converted to proportions:

Table 3: apple orange tomato broccoli cabbage almond cashew
Ali 0.2 0.2 0 0.2 0.2 0.1 0.1 Σ=1
Eric 0.2 0.1 0.2 0.1 0.2 0.1 0.1 Σ=1

Table 4: Regroup components in fraction form:

Table 4: fruit vegi nut
Ali 0.4 0.4 0.2 Σ=1
Eric 0.5 0.5 0.2 Σ=1.2
Ratio Ali:Eric 0.8 0.8 1

Ratios (same for counts and proportions):

Table 5: fruit:vegi fruit:nut vegi:nut
Ali 1 1 2
Eric 1 1 2.5

Truth: Ali has a smaller proportion of fruits and vegetables then Eric (20% less) while the proportion of nuts are the same. The ratio of Ali’s fruit or vegetables to Eric’s is 0.8 and the ratio of nuts is 1.

PART 1-A: TOTAL SUM SCALING

(1) Split method: split each component evenly by N groups.

Table 6: fruit vegi nut
Ali 4 4 2 Σ=10
Eric 4 4 2 Σ=10
Ratio Ali:Eric 1 1 1
Table 7: fruit:vegi fruit:nut vegi:nut
Ali 1 1 2
Eric 1 1 2
Ratio Ali:Eric 1 1 1

In this method the count of tomatoes was split between the fruit and vegetable group (Table 7). Problem using this method is that the abundance of fruits and vegetables falsely appears to be identical in Ali’s garden vs Eric’s garden when Ali actually has smaller proportions of fruits and vegetables then Eric (4 vs 5 or .4 vs .5). The problem arose due to normalizing based on the grouping library size. Because a component was included in 2 groups its count was included twice in the grouping library size.

(2) Group-TSS normalize groups by group library size

Table 8: fruit vegi nut
Ali 0.4 0.4 0.2 Σ=1
Eric 0.4166666667 0.4166666667 0.1666666667 Σ=1
Ratio Ali:Eric 0.96 0.96 1.2
Table 9: fruit:vegi fruit:nut vegi:nut
Ali 1 1 2
Eric 1 1 2.5

Ali’s and Eric’s plant groups were each divided by the sum of the group total counts, 10 and 12 respectively (Table 8). Despite that the ratio between groups is correct (Table 9), the group-TSS normalization method falsely makes Ali’s abundance of fruits and vegetables appear smaller than Eric’s (04 vs 0.41667) while making Ali’s abundance of nuts seem larger than Eric’s (02 vs 0.1667).

(3) Component-TSS: TSS normalize groups by component library size

Table 10: fruit vegi nut
Ali 0.4 0.4 0.2 Σ=1
Eric 0.5 0.5 0.2 Σ=1.2
Ratio Ali:Eric 0.8 0.8 1
Table 11: fruit:vegi fruit:nut vegi:nut
Ali 1 1 2
Eric 1 1 2.5

In the component-TSS method, each group is divided by the sum of the original component counts. Alternatively, you can TSS normalize the original components and then group by plant food - the results are the same. Thus, Ali’s and Eric’s plant groups were divided by 10 (Table 10). The conclusions are correct based on the component-TSS approach: the ratios are valid and Ali’s abundance of fruits and vegetables are smaller than Eric’s and the abundance of nuts are the same. True, the library size of the groups is greater than 1 but we are not concerned about the group library size because the components have already been normalized by the library size.

PART 2-B: LOG-RATIO TRANSFORMATION

A similar problem arises if we desire to log-transform groups using log-ratio transformation. I argue that when there is a one-to-many issue in regrouping components, the reference should be based on the components, not the group total abundances directly. To demonstrate this additive-log ratio (alr) transformed the group counts (or proportions) using the component library size as the reference. This gave correct conclusions: Ali’s fruits are 0.8 more abundant (or 0.2 less abundant) then Eric’s fruits and nuts are equal in abundance (Table 12). Using the group library size as the reference in alr transformation results in false conclusions (Ali’s fruits are 0.96 more abundant then Eric’s fruits and nuts are not equal in relative abundance), Table 13.

I didn’t get the right conclusions when I centered-log ratio (clr) transformed the group counts using the geometric mean of the component counts or the group counts. I’m not sure why clr doesn’t work. I’ll have to think about this more.

Group-log-transform: Use the group counts as a reference to transform groups

Table 12 alr, component lib size as ref.: fruit vegi nut
Ali -0.9162907 -0.9162907 -1.6094379
Eric -0.6931472 -0.6931472 -1.6094379
Difference (Ali – Eric) -0.2231436 -0.2231436 0.0000000
Ratio (e^Difference) 0.8 0.8 1.0
Table 13 alr, group lib size as ref.: fruit vegi nut
Ali -0.9162907 -0.9162907 -1.6094379
Eric -0.8754687 -0.8754687 -1.7917595
Difference (Ali – Eric) -0.04082199 -0.04082199 0.18232156
Ratio (e^Difference) 0.96 0.96 1.2

Note: When using the lib size as the reference for alr transformation, the transformed counts will all be negative because the counts are smaller than the reference. To have all positive values for the transformed counts, the reference must be smaller than every count. To avoid interpreting negative numbers, simply divide all lib size of each sample by a constant large enough to make the lib size a fraction. For example, in our example we would divide each lib size by 10. The conclusions are always identical as long as all the lib sizes are divided by the same number.

Because we are working with log (natural base) transformed counts to find the ratio between Ali’s and Eric’s groups or to know how much more abundant Ali’s group is compared to Eric’s, take the difference between the group and compute the exponential value of the difference between the group alr transformed counts.

For example, Ali’s fruit alr transformed count is -0.9162907, and Eric’s is -0.8754687. The difference between them is -0.04082199. This means that Ali’s fruits are e-0.04082199 more abundant than Eric’s fruit, or 0.8 more abundant (20% less abundant).

Conclusion

Because the one-to-many problem artificially increases the library size, normalization of library size should be performed using the original components library size, not the group library size. Splitting membership of a component should not be performed when there we are certain about which group (e.g., ontology) the component belongs to.

R codes for alr transformation
x1 <- c(2,2,0,2,2,1,1) # Ali's garden component counts
x2 <- c(2,1,2,1,2,1,1) # Eric's garden component counts
y1 <- c(4,4,2) # Ali's garden group counts
y2 <- c(5,5,2) # Eric's garden group counts

Table 12:
alr transform groups using the component lib size as the reference

alr(c(y1, sum(x1)),ivar=4) # transform Ali’s garden
# -0.9162907 -0.9162907 -1.6094379
alr(c(y2, sum(x2)),ivar=4) # transform Erics’s garden
# -0.6931472 -0.6931472 -1.6094379
alr(c(y1, sum(x1)),ivar=4) - alr(c(y2, sum(x2)),ivar=4) # Difference between Ali’s and Eric’s
# -0.2231436 -0.2231436 0.0000000 # The actual output was -2.231436e-01 -2.231436e-01 4.440892e-16. I reported the results in standard form.
exp( alr(c(y1, sum(x1)),ivar=4) - alr(c(y2, sum(x2)),ivar=4)) # exponential value of the difference
#0.8 0.8 1.0

alr transform groups using the component lib size divided by 10 as the reference, to avoid interpreting negative transformed counts.

alr(c(y1, sum(x1)/10),ivar=4)
# 1.3862944 1.3862944 0.6931472
alr(c(y2, sum(x2)/10),ivar=4)
# 1.6094379 1.6094379 0.693147
alr(c(y1, sum(x1)/10),ivar=4) - alr(c(y2, sum(x2)/10),ivar=4)
# -0.2231436 -0.2231436 0.0000000
exp( alr(c(y1, sum(x1)/10),ivar=4) - alr(c(y2, sum(x2)/10),ivar=4))
# 0.8 0.8 1.0

Table 13:
alr transform groups using the group lib size as the reference

alr(c(y1, sum(y1)),ivar=4)
# -0.9162907 -0.9162907 -1.6094379
alr(c(y2, sum(y2)),ivar=4)
# -0.8754687 -0.8754687 -1.7917595
alr(c(y1, sum(y1)),ivar=4) - alr(c(y2, sum(y2)),ivar=4)
# -0.04082199 -0.04082199 0.18232156
exp( alr(c(y1, sum(y1)),ivar=4) - alr(c(y2, sum(y2)),ivar=4) )
# 0.96 0.96 1.20
1 Like

Correction: That’s a 25% increase (.4 x 1.25 = .5), not 50% increase.

Regarding log-ratio transformation of compositional data analysis (CoDA), if additive-log ratio (alr) transformation was performed using the nuts group as the reference (denominator), the results would be correct as well. Library size doesn’t matter using this method because alr takes the ratio each individual components to one chosen component as a reference and therefore the library size cancels out. This avoids the whole one-to-many problem and can be applied directly to groups. However, the choice of reference does matter. If fruits or vegi’s were used as a reference the results would be incorrect. It might be a good idea not to chose a group with a one-to-many problem as a reference and chose a reference that is stable and ubiquitous across samples.

For users to see the update on this conversation:

Any new thoughts on this idea?

1 Like

I applied a similar alr method in my published metagenomic paper and it looks like it worked well and may outperform the classical clr method. The metagenome function feature (e.g. metacyc pathway, GO term, etc) I chose as the reference (denominator) was the feature with the lowest variance across samples after clr transformation–I wanted to choose the function that was most stable. While I performed a simulation and compared this method with other methods (see the supplemental file), more work is needed to validate it.

Thanks for the follow-up! I hadn’t forgotten about this discussion - we are planning to add this style of normalization as a new default in HUMAnN 4 (along with some other options that have been requested over the years). Stay tuned. :smiley:

2 Likes

Sounds great! I’m surprised that I don’t see any discussion on this one-to-many normalization issue when grouping genes in the literature. This calls for writing a paper on this issue :wink: