Hi!
I perform in vitro simulation of colonic fermentation. In a recent study, we have a very specific cohort that are in two conditions: treated with a probiotic, or non-treated. After shotgun sequencing, that cohort have zero count across donors of the probiotic strain, and, obviously, count > 0 in all the treated conditions.
After performing Maaslin3 with normalization = TSS, and transformation = LOG, that specific strain gives: Fitting error (NA p-value returned from fitting procedure), in the abundance model; which I assume is because there is a perfect separation of the two conditions and thus it cannot compute p_values. I imagine that such case are very rare in clinical settings, but using in vitro simulation with fewer donors, this may be more frequent than not.
When using normalization = TSS, transformation = PLOG, zero_threshold = -1, it does give a p_value + coefficient and no more error for that strain. Now, the manual specify that PLOG is better used for metabolomics, and I wonder what are the implications for metagenomics analysis for all the other taxa that were non-zero in all conditions, when evaluating in abundance models (not prevalence). Is there any risks associated?
At first thoughts, I assume that rare taxa will be the most impacted as samples that initially had true zeros now have a pseudocount instead. I would have fewer rare taxa detected? I now do not filter on beforehand based on prevalence/abundance.
Thanks a lot!
Hi,
When you say you’re getting a NA p-value, is that in the abundance or the prevalence results? The prevalence modeling has a system built in to avoid linear separability specifically for cases like this, so linear separability shouldn’t be generating NA p-values with the logistic regression models. The abundance models might give an NA p-value since the abundance model only applies on the portion of the data with non-zero abundance, which in this case would all have the same group (treated) and therefore have no variation in the independent variable.
If you’re seeing a NA p-value in the logistic regression results, would you mind sending me the MaAsLin 3 run log and/or a chunk of the data that reproduces the issue since that shouldn’t be happening? Either here or willnickols@g.harvard.edu would work.
Will
This is indeed in the abundance model, I was not interested in the prevalence one for this analysis. I can imagine that statistical testing of 1 group with only zeros and the 2nd group with non-zero values is not very logic, so I assume there is no magic solution beside focusing on a prevalence model? The pseudocount (PLOG) would impact the rarer taxa, right?
Hmm… at least for the probiotic taxon, I would think prevalence is the relevant metric since it addresses the probability the taxon is present in the treatment group relative to the control group. (And if things worked perfectly, it would presumably always be there in the treatment group and never in the control group.) I’m not sure an abundance association would be helpful here since if the taxon is absent in the control group, you’d find that the average log2 relative abundance in the control group is negative infinity, so your log fold change between the two group is infinity. The PLOG gets around this by putting a minimum count rather than a 0, but this effectively implies you think the taxon was present in all the control samples but was half the abundance you could’ve detected.
It all makes sense Will, thanks a lot for the support and reasoning 