Hi, Maaslin users and developers, I’m here for some suggestions about how to treat the confounders in my metadata. The goal is to associate rumen microbiome with methane emission while also taking animal lactation stage, herd location and sequencing batch into account.
Now I set lactation stage, which has 3 levels: Early, Intermediate and Late, and methane emission as the fix effects. I set the lactation stage level “Early” as the reference level. For the random effects, I have sequencing batch there. But I talked with my professor and he disagreed with idea to set the “herd location” as random effects.
According to my statistic test, the herd locations were significantly associated with methane emission. Of course, if it’s possible, we will also be interested in associating the herd locations with the rumen microbiome. However, I’m not convinced to model herd locations as fix effects because apparently we don’t have all the possible herd locations in our datasets. So I think it’s more suitable to model it as a random effect variable.
As a beginner, I would be grateful if anyone can share some of your insights with me. When should I model a variable as a random effect variable (even when I’m indeed interested in it’s association with the microbiome)?
I suggest you manually inspect the processed data in output/features/filtered_data_norm_transformed.tsv and try plotting a species of interest (especially if there’s a bug where you already know there should be an effect). Overlay your sequencing batch / herd / stage metadata as color or something and visually assess how those factors relate to the species abundance.
Nested random factors are outside the set of simple random effects models that Maaslin2 can handle. Maaslin2 can only string together multiple random intercepts by different grouping variables e.g. <fixed_effects> + (1 | g1) + (1 | g2). The lme4 vignette (especially table 2, rows 3 and 4) and this thread on CrossValidated can help explain some of these terms if you aren’t familiar with them.
I know nothing about ruminants, but grouping structure of related sets of observations is a classic use case for random effects. The main advantage of random effects is partial pooling - this gives your model “memory”. After your model has seen herds 1 through N, it should have a reasonable idea of the range herd N+1 should fall into, it doesn’t need to (and possibly can’t) learn it independently of the other herds. The statistical rethinking lecture linked below (and really the whole series) gives a good overview of multilevel modeling if you would like more background.