First, thank you for developing the MaAsLin3 tool! We are planning to use MaAsLin3 to perform a longitudinal microbiome analysis of species abundance changes during disease progression using linear mixed-effects models.
Our dataset includes: ~90 subjects with 3 time points and ~140 subjects with 2 time points. Most first follow-ups occur around 2 years after baseline, while some are closer to 4 years.
I had a few questions:
Is it generally acceptable to include together subjects with both 2 and 3 time points in the same model, or would it be preferable to restrict analyses to subjects with 3 time points only despite the lower sample size?
Given that around twice as many subjects have only 2 time points, would you recommend using only random intercepts, or both random intercepts and random slopes? We were also considering an alternative approach where subject-specific slopes are first estimated using linear mixed models and then tested in downstream linear regressions. Would this make sense with this type of dataset, or would a standard mixed-effects framework be preferable?
Any concerns regarding the variability in follow-up timing between subjects (e.g. first follow-up at ~2 years vs ~4 years) even though time is included directly in the model?
It depends what you’re trying to test and what the characteristics of your 2 populations are. If the 140 subjects with 2 time points are essentially the same population as the first 2 time points of the 90 subjects with 3 time points, merging them should be fine. If they represent some different population or treatment group, merging the two groups could be problematic unless an additional variable is included in the model to account for the difference.
I’d use random intercepts per-subject. Unless you think you have strong per-subject time trends, random slopes are usually unnecessary and make the model quite a bit more difficult to fit. I’m not sure I follow the proposed subject-specific slopes approach, but it seems like this would be nearly the same as a random slopes model.
Whether this is a concern depends on whether there’s reason to believe something different is happening in the 2 year vs 4 year cases. If you’re only encoding follow up as “original” vs “subsequent” and lumping 2 and 4 years together, differences in the 2 and 4 year groups could be an issue. If you’re encoding follow up as “original” vs “2 years” vs “4 years” you should be safe from differences in the groups.
Thank you very much for your quick reply, it is very helpful!
For the first point, the subjects are essentially from the same population/cohort. The difference is mainly in data availability: some have 2 time points, while less 3 time points. My question was therefore more about whether linear mixed-effects models can handle this type of unbalanced longitudinal structure, and in particular having more subjects with 2 time points only.
We were considering the subject-specific slope approach because we wondered whether estimating individual slopes, and then testing their association in a downstream linear regression, would be a reasonable way to directly model “change–change” relationships.
The mixed effects model should be fine with only 2 time points per subject (and it’s also fine if you have some with 2 and some with 3).
I think I still don’t follow what the goal is with the subject-specific slope. It sounds like you might be trying to associate the subject-specific slope with some other metadatum (i.e. did the people in one group change more than the people in another group). However, if that’s what you’re after, I’d recommend using an interaction effect. If you can explain what kind of effect you’re looking at, I might be able to give more specific recommendations.
Thank you again for your response. For the point 2, we were thinking of estimating individual subject-specific slopes for microbial abundance changes, and then assess whether these are associated with subject-specific slopes of pathology accumulation. The goal was therefore to model longitudinal change–change relationships between microbiome dynamics and disease progression.
Got it. An interaction model (formula: ~ Time * Pathology) would answer the same question, so I think that would be the cleaner method. The time coefficient will tell you, for the the base level of pathology outcome, what the difference in the microbiome at the time points is. The interaction term will tell you, for the alternative pathology slope/outcome, how much the microbiome change between the time points differs from in the base pathology level (i.e. difference in differences).
I was hoping to get some more advice regarding a few additional questions on longitudinal analyses.
In our dataset, we have longitudinal measurements for both the microbiome and other variables of interest. These variables can be either pathology measures (continuous or binary) or protein levels (continuous variables). Our working hypothesis is that changes in the microbiota may influence proteins and pathology, which is why we initially considered using microbiota features as predictors and proteins/pathology as outcomes, but we recognize that the relationship probably operates in both directions.
I was therefore wondering:
If I use MaAsLin3 with microbiota features as the outcome, can the independent variable also be longitudinal? In other words, can both the predictor and the outcome have repeated measurements over time within the same model?
Is it possible to use MaAsLin3 with microbiota features as predictors and proteins/pathology as outcomes, all having longitudinal measurements? If so, how would you recommend specifying the model in this situation?
In most examples I have seen, the microbiota features appear to be treated as the outcome and clinical variables as predictors. Since our primary interest is understanding whether microbiota changes are associated with changes in proteins or pathology over time, I would be very interested in your thoughts on the most appropriate modeling strategy. Any additional comments, suggestions, or considerations regarding these models and how best to proceed would be extremely helpful.
Both the dependent and independent variables can have repeated measurements over time. You would just include the data points as usual and use a formula that has the metadatum and a random intercept per-subject. This would tell you whether the metadatum is associated with the feature when they’re measured at the same time points but still account for the fact that subjects were sampled repeatedly. If you wanted to consider some sort of time lagged model (e.g., pathology affects the microbiome in the future), you would do the same thing but instead of using the metadatum value for the same time point as the microbiome measurement, you would set up the data so that each microbiome measurement is paired with the metadatum from the time point before. That way your model becomes feature ~ metadatum_at_prior_timepoint. Running MaAsLin still works the same way though; it’s just a data encoding difference.
Assessing features ~ metadata vs. metadata ~ features requires two different frameworks since the features are assumed to be high dimensional and the metadata are assumed to be low(ish) dimensional. In features ~ metadata, the main concern is multiple hypothesis testing since you get a p-value for each of the many features; this is what MaAsLin is about. In metadata ~ features, the main concern is about overfitting/stability since this is a classic data science problem of predicting an outcome given a high dimensional input. This would traditionally be solved by things like LASSO or Ridge regression. In practice, you’ll usually get directionally similar results: if a metadatum is positively associated with a feature, the feature is positively associated with the metadatum. Strictly speaking, a linear modeling system like MaAsLin tells you “feature A is/isn’t positive/negatively associated with metadatum 1 when controlling for metadata 2, 3, etc.”, so most people are satisfied with this sort of statement whether they believe the true causal path is microbiome → pathology or pathology → microbiome.