Transformation or GLM + link function?

I am trying to create a model to determine if categorical variables of interest (i.e., if patients had surgery, disease assessment by physicians, etc) are significant predictors of specific KOs found in our Humann3 output.

I know it’s recommended to use the total-sum scaled abundances as input into Maaslin3, then use a log transformation and an OLS regression. For some other analyses, I had transformed the KOs’ TSS abundances using the arcsine square root transformation. However, now that I am working on making a GLM, I am a bit stuck.

I am currently deciding between four options and would appreciate some insight:

  • log transformed TSS abundances as input into OLS regression with categorical variables
  • arcsine square root transformed, TSS abundances as input into OLS regression with categorical variables
  • arcsine square root transformed, TSS abundances as input into GLM using the Gamma family (aka logit link function)
  • log transformed TSS abundances as input into GLM using the Gamma family (aka logit link function)

After the arcinse square root transformation, some of my KOs of interest are not normally distributed, which is why I am a bit stuck on my methods here. Gamma seems like the most appropriate distribution to use because it can handle positive, continuous non-integer data.

If anyone has suggestions or references to share, that would be extremely helpful. Thank you so much!!

Hi,

Arcsine square root has historically been considered in differential abundance analysis because it’s the variance-stabilizing transformation of the binomial proportion. However, at least within our group (and across the literature from what I’ve seen), it has fallen out of favor because it’s difficult to interpret what regression coefficients mean on the arcsine square root scale. By contrast, coefficients in regressions on log transformed data are very straightforward: a one-unit change in the covariate corresponds to a 2^Beta multiplicative change in the relative abundance. Furthermore, microbiome relative abundances tend to be approximately log-normally distributed, so after log transforming the relative abundances you get approximately normal data as you would hope to use in a linear regression.

Regarding OLS vs Gamma family regression, once you’ve log transformed, the data will typically be approximately normally distributed and negative (since log2 of something between 0 and 1 like a relative abundance is negative). Therefore, OLS rather than gamma will probably be the way to go.

Will

2 Likes

This a very helpful explanation, thank you so much Will!