Selecting covariates - MaAsLin2 DA

Hello,

I am running differential abundance analysis with MaAsLin2 and wondering how to approach the inclusion/exclusion of co-variates in the linear regression model.

I have 5 variables: total ammonia nitrogen, copper, nickel, lead and chromium. However, the redundancy (RDA) analysis found that total ammonia nitrogen was the only significant environmental variable that explained the variance in community composition (with Hellinger transformed relative abundances). The other variables had a negligible role in explaining the composition.

Therefore, in the differential abundance analysis with MaAsLin2, I have included only TAN as a fixed effect / term in the linear model to predict the log2 fold changes in transformed relative abundances of the taxa. I have not included the other variables as co-variates because they did not have an impact on explaining the overall community structure based on the RDA. Also, because this is part of a factorial experiment orthogonality deals with confounding between the variables.

Running MaAsLin2 with only TAN as the predictor shows 3 significantly different taxa between the factor levels, however, when including all the main effects, the significant differences for TAN disappear. Is this because the degrees of freedom are reduced because non-significant terms are added into the model?

Is using TAN as the only predictor in the differential abundance analysis a reasonable approach? I am cautious that there is an implicit assumption in this that because TAN is the main driver of the overall community structure (global) then it will likely explain any significant differences of individual taxon (local).

I would really appreciate any thoughts or tips on this.

Many thanks,

Sam

1 Like

Hi Sam,

When using all the variables in the model, you’ll get ~5x the number of associations, which means there’s 5x the number to false discovery rate correct over. If the associations in the TAN-only model were on the weak side already, their significance will probably be eliminated if there are 5x the associations going into the FDR step. Even for the TAN-only model, I’m wary of selecting TAN first based on a global analysis and then taking the MaAsLin p-values/q-values as legitimate since TAN has already been pre-selected from the data as being most significant.

I’d run 2 models:

  1. I’d run MaAsLin 3 repeatedly with a formula consisting of one variable at a time so you can see what the marginal associations look like. Especially if you think the variables are confounding each other, this would be a good point to decide what you think is driving any associations and what’s along for the ride.
  2. I’d also run MaAsLin 3 with a formula consisting of all 5 variables at the same time to see what the association is with each variable when controlling for the rest.

In the best case, (2) will give some significant associations and you can interpret those as the association of abundance/prevalence with the particular variable, controlling for the rest. If only (1) gives anything significant, I’d check whether you have previous data or whether there’s previous data in the literature you can analyze to make a principled claim for why you care about one variable in particular and then focus on that. Notably, I would not take the results of (1) and draw conclusions based on those nominal p/q-values, since they’re not FDR corrected across the 5 different models run.

If you have 50+ samples, I’d use the default LOG + TSS in MaAsLin 3 to get abundance and prevalence associations, and if you have less than 50 samples, I’d use PLOG + TSS to try to get as much power as possible without splitting abundance from prevalence associations.

Will

1 Like