I have microbiota data from the same individuals in 2 timepoints. What is the correct formula to study the change / difference between timepoints?
There actually also are 3 experimental groups, but it might not be possible to use experimental groups AND individuals as factors in the same time, as we have limited n and the individuals of course belong to groups (?).
I would use ~ Indicator_of_time_point_2 + (1|subject_id) with small_random_effects=TRUE. The coefficient fit to Indicator_of_time_point_2 will give the average difference between time points, and (1|subject_id) will control for repeated sampling. If you want to evaluate the effect of each experimental group, you could add + experimental_group, but if the groups are just uninteresting subsets of individuals that don’t change per person (e.g., hospital or recruitment site), the subject_id random intercept will be enough.
I have a few questions regarding this topic of longitudinal analysis using MaAsLin3. I am working with a similar approach as proposed by maikael, but in this case, we have 4 groups with 2 time points (before and after treatment). I want to evaluate whether the intervention has produced changes in the gut microbiome between the groups.
Some individuals dropped out of the study (around 5%), so I have more individuals at one time point than the other.
I am considering using the following model with sex as a covariate:
formula = ‘~ Group*Timepoint + (1|ID_subject) + Sex’,
reference = c(‘Group,Control’)
normalization = ‘TSS’,
transform = ‘LOG’,
augment = TRUE,
standardize = TRUE,
max_significance = 0.1,
median_comparison_abundance = TRUE,
median_comparison_prevalence = FALSE,
max_pngs = 250
Do you think the model is ok? One of my doubts is whether the “Before” timepoint should be set as the reference as well…
When I try this model, I obtain this error:
“Error in d[setdiff(unique(metadata[, names(l)]), names(d))] ← NA: only 0’s can be mixed with negative subscripts”
On the other hand, if I specify the random effects like this:
formula = ‘~ Group*Timepoint + Sex’,
random_effects = c(‘ID_subject’),
the model works fine. I’m not sure whether this could be due to the differences in sample values between time points.
Finally, the results that I am getting are only referring to Group or Sex. My understanding is that if there were changes in the interaction, I should have received a file with “Group:Timepoint” results, correct?
The formula you’ve listed at the top (~ Group*Timepoint + (1|ID_subject) + Sex) is okay in principle, but I’m going to recommend you change it (see below).
If Timepoint is a categorical variable, it should be set as a factor beforehand (safest) or included in the reference.
I suspect the reason for the error is that the random effects are somehow invalid in the model given the metadata or you’re missing levels of the random effect (e.g., because of multicollinearity). Also, check that ID_subject is a factor or character vector and not a numeric vector (i.e., did you name your ID_subject 1, 2, 3…?). The reason the second model works fine is that when both a formula and some other effects are included, only the formula is used (so you’re just fitting without the problematic random effect). I just pushed a new MaAsLin 3 version that errors out when you try to include both so that people don’t make this mistake without realizing.
Yes - there should be a Group[non-baseline group]:Timepoint[non-baseline time] in your results unless your metadata are somehow collinear such that these interactions can never be fit.
With all that said, I’d suggest some changes to the model call in order to test what I think you want to test:
Do check that your ID_subject column has all the values you think it does and that they’re characters or factors.
Include the parameter small_random_effects=TRUE in your run. This is recommended to avoid issues with the logistic fit unless you have many many observations per ID_subject, especially in a complicated model like this.
Using small_random_effects=TRUE is probably going to newly introduce multicollinearity into your model if you have one group per subject (i.e., each subject is only treated or not treated). (This would’ve been a silent problem with the model before anyways.) As long as the two groups are exchangeable at baseline (e.g., you randomized treatment vs. control or you had a bunch of identical mice), I would recommend the formula ~ Group:Timepoint + Timepoint + (1|ID_subject). The Timepoint coefficient will be the change in features from Before to After in the Control group, and the Group:Timepoint will tell you how much the Before vs. After difference changes for the non-Control group relative to the Control group. That is, the total Before vs. After change for the non-Control group is Timepoint + Group:Timepoint. Including the full interaction term (with a *) won’t work because of multicollinearity, but Group:Timepoint should be the difference due to treatment that you care about.
Finally, if you control per-subject with ID_subject, you won’t need to (and shouldn’t) also include Sex because Sex is (presumably) fixed per ID_subject, so it is taken care of when controlling for ID_subject.
Thank you so much for your help Will, and for developing this tool! It is awe-inspiring!
I got that random effects work within the formula, and I have built the model following your suggestions.
If I understand well, the results will be referred to the control group (the one used as reference). So, if I want to assess the differences between the four groups (3 treatments and 1 control), should I change the reference for the group and make a different model for each? For example to see the differences between treatment 1 vs treatment 2 vs treatment 3
To clarify the interpretation of results, focusing on the formula suggested:
~ Group:Timepoint + Timepoint + (1|ID_subject)
reference = c(‘Group,control’, ‘Timepoint,Before’)
If I obtain significant taxa for:
a) Treatment1:After. Does this mean this treatment has produced a change in the taxa compared to the control?
b) Treatment2:Before. Does this mean the abundance of this taxa is different at baseline between treatment2 and the control?
c) After. Does this mean the abundance of this taxa is different after the intervention in the control group, or in general for all the groups?
Curiously, from my metagenomics results (nearly 15,000 taxa), many have low p values but after adjustment, the q-value is 1. With the large number, I understand many are false positives, but in the end, for the abundance category, only 4 taxa have a q-value less than 0.1. I am starting to delve into gut microbiome analysis and I do not know if this is normal or if we might be too strict with the correction.
Another question is regarding the graphs for the features that are significantly different. Since I have two-time points, they are represented in the graph. For example, if I have 50 individuals in the control group with two time points, the graph shows 100 points for the control, because it does not differentiate between time points. Is there a way to modify to obtain the representation by time point?
Sorry whether the questions are quite basic, but I am new to this type of analysis and trying to learn as much as possible!
Re: 1: Your groups will only be compared against the control group by default. If you want to compare them against each other, you can use the contrast test (see this thread for a similar case). However, unless your model takes a long time to fit, it’s probably easier to just change the reference and run the model again.
Re: 2: Treatment1:After1 indeed means the treatment has produced a change in the taxa compared to the control, and it is the difference in the treated group and control group at time After1. Treatment2:Before (I think) shouldn’t exist in your results because Before should be taken as the reference Timepoint. If this is showing up, can you try setting Timepoint as a factor explicitly? Maybe the reference is getting tripped up by the interaction. After gives the difference after the intervention in the control group only.
Re: 3: As corrections go, BH is pretty lenient, so I don’t think it’s an issue with the correction. Having only a few significant hits isn’t that unusual either and it makes for a clearer story in an analysis than “a whole bunch of things changed.” However, 15,000 taxa is a ton (way more than I usually see except in enormous meta-analyses), and if most of those are very rare, they’re probably driving up your q-values because you don’t have enough power to detect a significant difference in the rare ones. How did you arrive at 15,000 taxa? I wouldn’t normally recommend filtering in MaAsLin 3, but you might consider setting min_prevalence=0.01 and min_abundance=0.0001 or something like that.
Re: 4: I’m not sure I follow - can you post the plot so I can see what you mean? The MaAsLin 3 plots are typically intended for diagnostic purposes, so it might be better to just recreate the plot in ggplot directly.