MaAsLin2 with non-clinical data and small samples

Hello,

I’m a student with a basic background in statistics and bioinformatics. Currently, I’m working on microbiome and metabolite data for my final project. I would like to ask advice on how to best set up MaAsLin for my analysis.

This is the experimental design:
We collected fecal samples from two donors and performed in vitro fermentation. Each samples received two different treatment. We measure microbiome composition, pathway abundance, and gene families at two time points (baseline and after treatment) and from two colon regions. There are no technical replicates, so we have a total of 16 samples.

I want to analyze the effect of treatment over time on the microbiome composition. We also want to perform correlation analysis between changes in microbial features (including pathway and gene families) and metabolite profiles later.

I would like to ask a very simple question: do you think using MaAsLin2 (or MaAsLin3) makes sense for this kind of paired, small-sample design? If so, what would you recommend using as fixed effects and random effects?

I’ve read some discussions on this forum but am still unsure if this is the right tool or if I should another tool instead, because this is what my supervisor suggest.

Thank you for the help.

Hi,

MaAsLin 3 can definitely help analyze the differences between donors, treatments, time points, and colon regions. 16 samples with 4 covariates might be too few samples to get statistically significant results, but you should at least be able to get a sense of effect sizes and community shifts. I’d recommend the model ~ donor + treatment + time_point + colon_region (all fixed effects) since all your covariates have only 2 levels and you don’t have any clustering variables besides donor. This same model would work for metabolites, gene families, and pathways.

If you want to correlate two high-dimensional datasets (e.g. taxa vs. metabolites), HAllA might be the way to go.

Will

Hi Will,

Thank you very much for your quick response and clear explanation.

I have one more question: I want to analyze whether the two treatments have different effects on microbial composition after 10 days of fermentation.

Do you think I should include an interaction term for time × treatment in the model? Or, given the small number of samples, would it be better to keep the model simple and avoid complicating the MaAsLin computation?

For context, my variables will be as followe:

  1. time0_treatment1 (reference)
  2. time10_treatment1
  3. time0_treatment2
  4. time10_treatment2

Thank you for your recommendation to use HAIIA, I will check it out.

If you’re interested in different effects of the treatments after 10 days, that’s an interaction question, so you would need the interaction term. Still, given the sample size, I’d doubt there would be significant effects.

True. As you suggest, I’ll be focused on the size effects and community shifts then.

Thank you for your advice. It’s really helpful!

1 Like

Hi Will,

I have a couple of follow-up questions and would really appreciate your input.

  1. Could you advise on how to set only the first variable as the reference level in MaAsLin3?

Currently, I’m doing the following to set reference levels:
metadata$time ← factor(metadata$time, levels = c(“0”, “10”))
metadata$treatment ← factor(metadata$treatment, levels = c(“treatment1”, “treatment2”))

However, when I include the formula time*treatment in the MaAsLin3 model, the results table only gives a value for “10_treatment2”, and not for “0_treatment1” or “10_treatment1”.
Should I create a new combined column for time + treatment in the metadata (as in MaAsLin2) to get full results? Or is there a more straightforward way to resolve it?

  1. Also, in your first answer, you specifically mentioned that MaAsLin3 is definitely suitable for my experimental design. But what do you think about MaAsLin2—would it also be appropriate? I’ve been using MaAsLin2 for a while and am still finding MaAsLin3 a bit challenging to set up. So I’m wondering if it’s okay to continue with MaAsLin2 for this analysis, given our study design.

Thank you very much for your time. I really appreciate your help.

Regarding the reference level, you could do the following:

metadata$time_treatment <- paste0(metadata$time, "_", metadata$treatment)

(you should get 4 levels out) and then set the reference level accordingly.

MaAsLin 2 could also be used for any of this.

Will

Hi Will,

Thank you very much for your input.
I’m now able to generate results for each variable.

I still have some follow-up questions:

  1. How can I test for individual main effects if I manually add the interaction terms as you suggested?
    I tried running both of these models:

Model 1 ~ time_treatment + time + treatment + colon + donor
Model 2 ~ time_treatment + colon + donor

But they yield the same results. I don’t see any output for the individual main effects of time or treatment, even when I include them in the full model.

  1. I tried to run both maaslin2 and maaslin3 with same set up of normalisation = “TSS” and transformation = “LOG” . However, I get a lot of warning messages in MaAsLin3 similar to this:

WARNING::Fitting problem for feature 334 returning NA

But when I run the maaslin2, I don’t get any warning messages at all. Is it normal?

  1. Lastly, may I know how does MaAsLin3 handle zero values when normalization = "NONE" and transformation = “LOG”?
    Should I add a pseudocount manually before running the analysis, or will MaAsLin3 automatically handle or skip zeros in log transformation?

Many thanks for your help.

  1. Because time + treatment are fully collinear with time_treatment (i.e., knowing time_treatment determines time and treatment), R will automatically drop the redundant columns when fitting the linear model. The “main effects” of time and treatment would be the time[baseline]_treatment[non-baseline] and time[non-baseline]_treatment[baseline].
  2. For MaAsLin 2, check that you’re actually getting no errors in the log since they’re not displayed in the results table. Assuming you’re actually getting no errors, it’s probably because of how the models function differently. When we split the abundance and prevalence in MaAsLin 3, we get more precise association types, but the abundance models will have fewer data points since they only use non-zeros (possibly resulting in too few data points for the model to fit), and the logistic prevalence models can have a harder time converging. With that said, it’s good to check the features that errored, but usually the errors are from there being very few samples with the feature present or too many terms in the model. In either of these cases, I wouldn’t trust a model that fit without errors too much anyways.
  3. MaAsLin 3 automatically splits the zeros from the non-zeros and uses the non-zeros in the abundance model and the zeros vs. non-zeros in the prevalence model.

Hi Will,

Thank you very much for your clear explanation.
it’s been really helpful for moving forward with my analysis.

I have another follow-up question regarding the interpretation of MaAsLin2 results.

In the end, I tested the effects of time and treatment independently. Here’s the script I used:

maaslin2_species_pc ← Maaslin2(
input_data = species_pc_matrix,
input_metadata = meta_pc,
output = “species_pc_main”,
min_prevalence = 0,
normalization = “NONE”,
transform = “LOG”,
fixed_effects = c(“time”, “treatment”),
reference = c(“time,0”, "treatment,diet1),
analysis_method = “LM”,
max_significance = 0.05,
standardize = FALSE,
correction = “BH”,
save_models = TRUE
)

I then subset the results by time and treatment to create separate volcano plots. However, I’m still unsure how to interpret the results accurately.

For example, here’s a snapshot of my results:

Does this mean that Klebsiella and Enterobacter were enriched at time point 10 compared to time point 0, regardless of treatment? Or does the result also take into account the diet 2 vs. diet 1 comparison specified in the MaAsLin2 setup—meaning that Klebsiella and Enterobacter were relatively enriched at time 10 compared to time 0 specifically within the diet 2 group?

I’m sorry, I’m having a bit of trouble phrasing the question clearly. Please let me know if anything is unclear.

Thank you very much!

Based on the model you’ve used with fixed effects for time and treatment but no interaction between the two, you’ve assumed that the change with time is the same for the two treatment groups and that the change between treated and non-treated is the same at the two time points. (It’s not “wrong” to make this assumption - I’m just making sure we’re clear on the model.) If you wanted to look for features that were enriched at a time specifically in one group, you would need an interaction term.