Log transformation pseudo-count

Does Maaslin2 add a pseduo-count prior to log transformation? I am using relative proportions as inputs, hence I set normalisation to “NONE”, but Maaslin2 throws an error saying that NaNs are produced due to zeros being encountered during the default log transform

Hi @adityabandla!

Thanks for the questions. Yes, the default log transformation incorporated into MaAsLin does add a pseudo-count. As is best-known practice currently, the pseudo-count is half the minimum feature. It does it with the following code:

LOG <- function(x) {
    y <- replace(x, x == 0, min(x[x>0]) / 2)
    return(log2(y))
}

The first thing I would check is for features that have community-level zero abundance, which would throw the error that you are describing. Either doing this on a case by case basis or applying an abundance and prevalence filter are easy solutions for this type of issue. If that is not the case for your data, can you send a minimally reproducible example of data that replicates this error?

Let me know if this helps!
Best,
Kelsey

Can someone clarify how the pseudocounts are calculated (per feature?).

I am running maaslin2 on humann3 pathway/reaction abundances (TSS normalized already as “Copies per Million”) with normalization set to “none” and transformation method set to LOG.

I recently noticed that features containing any zeroes seem to result in the coefficients/pvals getting flattened somewhat. This issue is especially severe for features where almost all the samples within a single group are zero (or are all zero) and has another group with very high counts. This ends up producing some very confusing results when looking at actual normalized abundance values compared to what maaslin2 is pointing out as statistically significant as you may have a feature where group1 is off the charts and group2 is almost zero or actually zero and Maaslin2 will give a relatively low coefficient and high Pvalue.

I manually imputed zero values by taking the minimum across all features/groups and dividing by 2 and re-ran Maaslin2. This seemed to “fix” this problem by providing coefficients and pvalues that seemed to better match the actual differences in abundance values for these samples.

Can you provide some more detail on how Maaslin2 imputes values for zeroes when using the LOG transformation?, it seems to be calculating the minimum value per feature. It doesn’t seem to make sense to do it per feature especially for sequencing data where in theory there is no expectation that individual features would have inherent differences in the ability to detect them? This also results in very nonsensical looking results for features with huge differences between the groups coming out as not statistically significant as I described above. Taking the minimum across all features and samples seems to make sense to me but is there a reason to argue against doing this?

hi @cbeekman,

Thanks for the feedback as using pseudo counts appropriately can always be a tricky issue. In the original testing of Maaslin2 we found that calculating the pseudo count per feature performed fairly well and made sense as differing bugs may have differing detection efficiencies etc. We also found that in some cases a global pseudo count ending up being arbitrarily far away from the nonzero values of some features after log transformation. Which then resulted in the zero inflated values masking any other signal for that feature.

That being said, I think the choice you have made in your analysis is well justified and is sensible and we will continue assess our use of pseudo counts as we make updates to Maaslin.

Cheers,
Jacob Nearing