Thank you so much for developing the great tool! I have a question regarding using the CPLM model recently in a Clinical microbiome data. I got feedbacks asking me to validate the linear model assumption and I wonder this is something can be done in MaAsLin. Could I get some advice please? I used MaAsLin2. v1.15.1.
Do you have any more information on what the reviewer meant by “validate the linear model assumption”? Without more information, it could mean a few things:
The mean of your feature abundance given the covariates is related to your (continuous) variable linearly: You can check for this by (1) plotting the residuals vs. fitted values to look for a trend (there should be none) or (2) plotting your (log transformed) feature abundance vs. covariate of interest and checking for a linear relationship.
The errors/residuals actually follow the CPLM distributional assumptions: This is pretty difficult to do properly, but you could just fit with different distributional assumptions (e.g., the recommended default: log normal) and see if your conclusions hold up.
Your observations are independent (or you’ve controlled for non-independence): Check that any natural grouping (e.g., per-subject) is controlled for with a random intercept.
If I had to guess, (2) seems like the most likely interpretation since compound Poisson isn’t a very common distribution for microbiome analysis. If it isn’t too difficult to change, I’d suggest using the default TSS normalization, log transformation, and base linear model, but hopefully you’ll get similar results either way.
I actually did 4 models including log normal and the CPLM. And I just went back to check between the results of the two models in one of my regressions. CPLM gave more results than the log normal and also showed much meaningful qval, in other words, I filter the all.results.tsv by pval < 0.05 and looked at the qval. qvals of CPLM results were all below 0.25 and qval of log normal models were above 0.5. There are some overlaps between the two sets of results, but CPLM really allowed for more findings in disease vs health.
Besides, as far as I can recall, the reviewer were more cared about how does MaAsLin adjust p val into q val and then slightly mentioned to check the linear model assumption. I would think that a plot of residues vs fitted values could fit the need. But I’m open to any further feedbacks from the reviewer and I can keep you posted.
Based on that information, I would caution that—as reported in the MaAsLin 2 paper—CPLM tends to produce more false positives than the log normal model, and this is likely amplified further if your model is much more complicated than treatment vs. controls. If you’re seeing more significant q-values with CPLM than log normal (especially if they’re all significant in CPLM vs none in log normal), it would be worth plotting a few of the significant relationships to make sure they seem right. A p-value just tells you how unlikely your data were under then null, and if your null (e.g., under a compound Poisson rather than log normal) is very inconsistent with the data, you can still get a significant p-value, regardless of whether anything important is happening biologically.