Hello,
First of all, thank you very much for developing and maintaining such an extraordinary suite of tools.
I am currently analyzing lung metagenomic data with a total of 63 samples divided into 4 groups. Unfortunately, one of the groups of high clinical interest is highly unbalanced and only contains 3 samples.
I would like to ask if MaAsLin3 is an appropriate tool for this specific scenario, what precautions I should take, and how conservative I should be regarding the inclusion of covariates. Currently, I ran the model using only sequencing depth (read depth) as a technical covariate. Interestingly, I obtained 4 significant associations exclusively for the group with n=3, while no significant associations were found for the larger groups.
Given the small sample size of that specific group, are these significant results statistically reliable, or are they likely artifacts driven by the low variance? Would you recommend a different statistical approach or alternative tool for this case?
Finally, I am also evaluating the functional profiling data obtained from HUMAnN3. Would your recommendations regarding sample size limits and covariate usage apply equally to the functional pathways analysis?
Thank you very much in advance for your time and support!
Hi,
I’ll assume your formula looked like ~ Group + Read_depth and the reference was one of the groups that had more than 3 samples.
Importantly for your case, ordinary linear modeling (what MaAsLin 3 is doing) makes the assumption of homoskedasticity, which means variance in the abundance is assumed to be equal across the groups. Therefore, if the variance estimated (primarily from the larger groups) is relatively small compared to the difference between the groups, you’ll get a significant result, even if one of the groups is quite small. The real question then is whether the variance is actually equal among the groups, and n=3 is almost certainly too small to assess this properly.
If you don’t want to assume the variance is equal among the groups you could do heteroskedasticity-robust regression (not currently an option in MaAsLin 3 but you could run it on your own), but with n=3 I’m pretty confident nothing will come out significant. I think there’s also a broader question of whether a reviewer/reader/etc. would find n=3 compelling even if statistically significant, and I doubt it.
My recommendation would be to take the rest of the groups (ideally of size at least 10) and focus the analysis on those, maybe leaving the n=3 as an interesting clue for future work. This should work for both taxonomic and functional profiling.
Will