I wanted to ask if Masslin2 currently supports using taxa as explanatory/independent variables in conjunction with logistic regression? It would address a significant need within the research community studying the relationship between microbiota and diseases.
Many researchers want to assess the hypothesis that “the microbiome impacts disease X.” To test this hypothesis, the taxa should serve as the independent variable, with disease status (or clinical outcome) as the dependent variable (the taxon-covariate model). Integrating logistic regression would be particularly beneficial, given that outcomes in these studies are often binomial (e.g., disease vs. control).
In contrast, the covariate-taxon model, where each taxa serves as the outcome variable, is particularly suitable when you want to assess how the composition of the microbiome influences disease risk or clinical outcome, which is not what most researchers are interested in.
Is this something that Maaslin2 can currently handle, or is there potential for incorporating this feature in future updates?
Thanks for checking out MaAslin2 and thinking so deeply about the modeling decisions that were made. We have also thought about this deeply in the lab and in an earlier revision we offered the ability to model the taxon as the independent variable.
However, in general we believe that using taxa as the outcome is generally sufficient in most cases for the types of questions researchers are asking (i.e. Does this microbe change in abundance under phenotype Y). As such we have opted to using this convention for maintainability and to keep things a bit more straightforward.
As an aside flipping the equation will generally give you similar results (and in many cases probably similar conclusions) but of course the results will not be exactly the same as it does change the error function that is being minimized.
From my mentor Prof. John Petkau (Prof Emeritus in the Statistics department at University of British Columbia):
You are absolutely correct that the most appropriate way to formulate the model depends upon the question being asked.
If the question is the effect of disease status on taxa abundance(s), then the abundance(s) should be used as response and disease status should be used as predictor. If the question is the effect of taxa abundance(s) on disease status, then disease status should be used as the response and the abundance(s) should be used as predictors.
In my view, this is a fundamental principle so important even in the linear model case … as long as one is thinking of assessing effects. Of course, if only interested in correlations, then no (linear) model is required … can simply evaluate the correlations of interest directly.
Nice hearing from you. Last time we met was at the IMPACTT conference
I agree in general that simplicity is best. See the comments from my mentor John Petkau in bold:
As an aside flipping the equation will generally give you similar results (and in many cases probably similar conclusions) but of course the results will not be exactly the same as it does change the error function that is being minimized."
I don’t follow this comment … it only seems to have meaning for the very simple case where both X and Y are univariate and continuous … the simple linear regression context?"
Of course, in that context, the correlation does not depend on what you call X and what you call Y. But, wrt prediction, regressing Y on X and regressing X on Y and then thinking of the two prediction equations as if they are simply y = a + bx and x = c + dy (that is, ignoring that it is really y_hat and x_hat on the left-hand sides of those equations) leads to two different lines in (x,y) space.
I wouldn’t think of those results as “similar” so I’m unclear what the comment intends.
With any study, the first issue should be: What question is this study trying to address? The analysis approach should reflect the question to be addressed.
Cheers, John
In any case, flipping the equation gives me significantly different results. Since my study is case-control, flipping it turns the linear regression model into a logistic regression model, which leads to a very different set of significant bacteria between the two approaches
Well I understand that flipping the equation will not give you the exact same results (especially if your using a completely different model) I would suspect that in many cases the ranked order of the coefficients to line up fairly well in the absence of large confounding effects.
That being said if you are really interested in model disease ~ taxa I would suggest checking out a LASSO method or perhaps something like Random Forest as these methods can better handle a large number number of predictors.