Analyzing Weekly Changes in Bacterial Profiles Following Repeated Administration of Formulations (MaAsLin 3)

Hello, and thank you for developing MaAsLin 3—it’s an incredibly valuable tool for our work!
I have three questions regarding its usage. We are analyzing microbiome data obtained via 16S amplicon sequencing. Our study includes two groups: 30 participants who received a placebo formulation and 28 participants who received the active formulation. Microbiome samples were collected at baseline (Week 0) and after four weeks (Week 4). To reflect this design, we prepared two categorical variables: Formula = c(Ctrl, Active) and Week = c(0w, 4w).

  1. In the feature table, several taxa could not be assigned and are labeled as “unidentified.” In some samples, these “unidentified” features account for roughly 30% of total reads. Should these unidentified features be retained when inputting data into MaAsLin 3, or is it more appropriate to remove them beforehand?

  2. Our goal is to evaluate whether the temporal changes in bacterial abundance observed in the Active group differ from those in the Control group. Would the following model specification be appropriate for this purpose?

    formula = "~ Formula*Week + Reads + (1|Participant_ID)"

    My concern is that the main effect of Formula under this specification may implicitly compare {(Ctrl, 0w) + (Ctrl, 4w)} with {(Active, 0w) + (Active, 4w)}, which may not reflect the specific contrast of interest. What model structure would be most appropriate for analyzing differential week-to-week changes between groups?

  3. We have a total of 58 participants, but approximately 100 potential covariates. I assume it is necessary to restrict the number of fixed effects to around six or fewer. If this is the case, would it be reasonable to always include key covariates such as Formula, Week, and Reads, and then run multiple models while rotating the remaining covariate candidates?

Please let me know if any part of my description requires clarification. Thank you very much for your time and assistance!

Best regards,
Sho

1 Like

Hi Sho,

  1. I would leave the unassigned as part of your sample for 2 reasons. First, if the percent unknown is significantly associated with your metadata of interest, that might be worth looking into. Second, if you remove them and then renormalize to scale to 1 per sample (either manually or by running MaAsLin 3 with its default parameters), you could be biasing the coefficients for all the other (renormalized) features.
  2. Yes - I would use that model. You will get a coefficient for formula (the difference in the Active vs Control groups at week 0), a coefficient for Week (the difference in the control group from week 0 to 4), and a coefficient for Formula*Week (how different the effect of Week is in the Active group relative to the Control group (the difference in differences)). Presumably one of these is what you’re interested in?
  3. Indeed, I would not recommend a model larger than ~6 variables with that data size. Which variables you choose should be driven by the scientific question of interest. If you’re interested in the effect of your treatment over time, I’d just use the original model you wrote. Especially if you randomized treatment, I’d just leave it there. If you instead have confounding variables, I’d include those. However, I would not fit a base model and then swap out all of the 100 variables in your remaining slot. At a minimum, you’d need to FDR correct over all the resulting p-values after the fact, and it would become quite difficult to say what you’re actually investigating scientifically.

Will

1 Like

Hi Will,

Thank you very much for your prompt response.
I now have a clear understanding of all three points!

Sho