Confounding factors

Hi,
I want to associate bacterial abundance with a disease state and some of the clinical data. The disease has several confounding effects (as age and BMI). I’ve read in several papers that maaslin treats confounding effects while others set the confounding effects as fixed effects. Can you suggest the right way to control for confounding effects?

Another thing- Is it preferred to use absolute counts or relative abundance (16s)? Do you recommend to rarify the sequences first?

Many thanks!

MaAsLin2 will handle potential confounders as additional covariates. If the covariate is categorical (e.g. recruitment site), then there’s an option to model the covariate as a fixed effect or a random effect. There’s some benefit to using a random effect if the covariate has >5 levels and many theoretical levels that are unsampled (for example, in the case of something like “recruitment site,” there are likely many potential sites that were NOT sampled, and so the variance explained by recruitment site will tend to be underestimated if it is treated as a fixed effect).

We recommend converting counts to relative abundance without rarefaction before running MaAsLin2.

1 Like

Continuing the discussion from Confounding factors:

Thanks!
So the advantage of setting confounder as fixed is that I’ll be able to see the association of each feature with the confounder, but MaAsLin2 will handle them anyway?

One last question- is it possible to feed MaAsLin2 with collapsed taxonomy (e.g. summing all the taxa of a genus to additional feature etc.)?

Hi there -
I’m not sure if I understood your follow up question. For MaAsLin2 you can set either fixed or random effects for confounding covariates. Both options can take care of confounding: usually fixed effects is enough, but when you have a categorical confounder with multiple levels, you might want to consider random effects (as Eric suggested).
MaAsLin2 is insensitive to taxonomy levels, so you can collapse species level features to, say, genera level. You can also manually collapse species of one specific genus while keeping the rest unchanged (I think this is what you were suggesting?). In that case I’d exclude species-level abundances from that genus in my input, keeping only the genus-level abundance - it wouldn’t make sense to count the genera “twice” in calculating per-taxa relative abundances.
Thanks!
Siyuan

Thanks Siyuan,
I believe that the main thing that still confusing me is that the when running maaslin2 it returns results only for the fixed effects (and still not sure why). For example, in the maaslin2 demo these are the fixed effects- ‘diagnosis’, ‘dysbiosisnonIBD’,‘dysbiosisUC’,‘dysbiosisCD’, ‘antibiotics’, ‘age’ but when I remove one of them I won’t get is as a metadata in the all_results.tsv file.

Back to our case- I’m studying a disease which age and BMI are confounders. I want to associate the microbiome with the disease but to control for the confounders.
BMI for example can be treated as categorical with 4 levels (underweight, normal etc.) or can be treated as continuous variable.

So If I understand well- treating BMI as fixed-
The model will be for each sample i=1…n i=1…n and BMI j=1,2,3,4 we estimate the disease state.

This way seems better than setting BMI as random effect and to add random intercept for each sample.

So,

  1. Why only fixed effects returns result in the HMP2 demo? Is it means that I need to set my disease effect as a fixed effect ?
  2. What do you think is preferred in the example of BMI? categorical fixed or continuous random?
  3. We might have some batch effects, do you recommend to assign the different sequencing runs as random effect?
  4. My disease is binary (yes, no) is it any recommendation for associate with such case (maybe specific analysis method)? or influence any of the above?

Many thanks for your help…

Hi there,

In response to your questions:
Why only fixed effects returns result in the HMP2 demo? Is it means that I need to set my disease effect as a fixed effect ?
Fixed effects are associations that you are primarily interested in, hence Maaslin2 reports them back (see next question for more explanation between fixed vs. random effect). In this case I’d set disease as a fixed effect.

What do you think is preferred in the example of BMI? categorical fixed or continuous random?
For BMI I would run both as a fixed variable, MaAsLin associated the bugs with what they are most associated with from the fixed effects in your model. Things we normally set as a random effect are technical confounding features such as sequencing batch, differences in kits… etc. or for longitudinal models the individual identifier is always set as a random effect. I would run separate models for the different BMI types if you want to both look at it as continuous and categorical.

We might have some batch effects, do you recommend to assign the different sequencing runs as random effect?
Yes, see above. That is completely justified as a random effect. If there is a wide range of sequence depths you can also use sequencing depth as a fixed/normal covariate.

My disease is binary (yes, no) is it any recommendation for associate with such case (maybe specific analysis method)? or influence any of the above?
Nope, that should be fine as a normal covariate in the MaAsLin model.

I hope this helps, please let us know if you still want some more details or any additional help we can give!
Best,
Kelsey

Thanks @Kelsey_Thompson!
This was very helpful!
I somehow thought that maaslin2 takes each time one fixed effect as the dependent variable and use the bugs as covariates. Anyway now things are much clearer.

After several runs of maaslin I found that in many cases the most signficant associations are those with very few non zero data points (sometimes even 1 out of hundereds).
Do you have any idea why it happens (it seems to have no biological meaning and I’m considering to filter cases with few non-zeros to reduce the number of comparisons).

Thanks!

Hi!

Are you filtering for prevalence at all? That could help with the outlier issue you are suggesting.

Best,
Kelsey

Hello,
After I finished doing the MaAsLin2 analysis with taxonomy file and metadata (such batch) file, how can I get an adjusted taxonomy file to reduce batch effects? Thanks very much.

Hi there,
Thanks for the question! I’m not exactly clear on what you mean by the adjusted taxonomy file, but we just released a new tool called MMUPHin that meant to reduce batch effects during a meta-analysis. That might be what you are looking for with this type of analysis.

https://www.bioconductor.org/packages/release/bioc/vignettes/MMUPHin/inst/doc/MMUPHin.html

Best,
Kelsey

Hello,
I have more than one confounding factors. I am just wondering how to blend my 3 covaries (1 categorical, 2 continuous) into the model. From the user manual under interactions, it seems like the model can do one confounding at a time, and that confounding factor has to be categorical?
Thank you!

You might only be able to add one random effect to the model (which must be categorical), but you can add other confounders/covariates (categorical or continuous) that are not handled as random effects.

The random effect is reserved for a “high impact” categorical variable that likely has other labels/levels that you didn’t measure. For example, subject/environment is a common random effect in studies with repeated measures of the same underlying communities (since you can’t exhaustively sample all subjects/environments).

1 Like

Just to add to @franzosa’s comment, MaAsLin 2 does allow the simple situation of multiple random effects for the intercepts, which as @franzosa mentioned, needs to be categorical and in addition, crossed or independent (nested random effects are currently not supported).

1 Like

Thank you both for your answer. Sorry for getting back to it so late. I guess what I wanted to look at is for example, my metal exposure to microbiome, adjusting for gestational_age, income, and breastfeeding. I am confused on the results. It seems like the model “fixed_effects = c(‘Cd_c’,‘Brestfeed’, ‘Gestational_age’,‘Family_income_Y6’)” takes all variates and screen them separately, so in the results output, I have the results from both variates I am interested in (Cd) and covariates I wanted to control for (breastfeed, gestational age, and family income). Is it means the covariates are controlled or it’s separate screening of each variate and taxa?
Thank you so much!

Thank you himel.mallick! I actually just found out package MMUPHin fits better with my aim because it has a covariate function to take covariates specifically and it also uses MaAsLin2. Thanks for Kelsey_Thompson mentioning it in the previous reply.