I’m looking to use MaAsLin2 to identify OTUs that are significantly associated with disease state (Y/N). I have multiple samples per healthy individual, but only one sample from each diseased individual. Therefore, I’d like to control for the pairs of samples present in the healthy/non-diseased group.
My initial thought was to use MaAsLin2 to run a model with disease state (Y/N) as the fixed effect and use subject ID (to control for multiple samples per individual in the healthy/non-diseased group only) as the random effect.
I don’t get any errors, but I’m concerned about the output. Out of the 412 OTUs, there are approximately 60 or so that are deemed to be significant. All of these OTUs are associated with the diseased group (disease=Y). I find this to be a bit odd, especially considering that this was not the case when I identified these OTUs using differential abundance analysis (DESeq2). Further, no heatmap was generated. I just wanted to be sure that my methods were appropriate before I consider including this in a manuscript.
I am pretty sure that my input data is correct. I used a conventional OTU table with rows as taxa. The metadata has rows as individual samples and the variables of interest are factors. Am I possibly doing something wrong? I am concerned with the sparseness of my OTU table (the samples are low biomass). I assume a zero-inflated model would be more appropriate, but then I could not incorporate fixed/random effects.
Thank you in advance for your time and hard work on this package!
Hi @Emily_Klann - I think your model is correct. If you are concerned about zero-inflation, you can try out one of CPLM, ZICP, or ZINB as your analysis method (we recently included the full functionality of these non-LM models in the most upgraded version submitted to Bioconductor and available in GitHub: https://github.com/biobakery/Maaslin2). Another alternative is to try out strict filtering to remove some of these highly zero-inflated features before running a linear model. The fact that your DESeq2 results are different from a default MaAsLin2 run with random effects is not surprising given that DESeq2 is not capable of fitting random effects and is not appropriate for repeated measures.
@Emily_Klann A quick follow-up on your comment on ‘
All of these OTUs are associated with the diseased group (disease=Y)’. Since your metadata variable (diseased group) is binary, MaAsLin2 internally models this as a binary variable with disease=Y as the reference level. Therefore, this association should be interpreted as the association with the entire diseased group, not with just one factor level. The value column in your output file (disease=Y) indicates that the coefficients (and their signs) should be interpreted with disease=Y as the reference e.g. OTU X is more abundant in disease=N as compared to disease=Y (if the corresponding coefficient is positive) and vice versa. I hope it makes sense.
Thank you! I feel a bit silly for not realizing this. I still have two questions:
- Is providing either relative abundance or raw counts (with normalization) more appropriate?
- Do you happen to know why the heatmaps are not being generated?
Regarding 2, the heatmap will only be generated if there are more than one metadata. There is no right and wrong answer to 1. Among the available options, LM, CPLM, and ZICP can be run on both relative abundances and counts, whereas, negbin and ZINB can only be run on counts. Both approaches have pros and cons, and we usually leave it up to the user.