Hi,
First of all, thanks for the great tool. I’m using Maaslin2 to identify genes that are associated with clinical covariates from transcriptomics data.
I notice that the p-value adjustment (benjamini hochberg, but this would be the same for all adjustment methods that use the rank) is done not just on genes vs outcome, but also on genes vs all the other covariates (for me BMI, Sex, Age, and sequencing batch).
Benjamini-hochberg calculates significance by comparing p-values to a threshold value, equal to (rank of p-value/number of tests)*fdr. If the former is lower than the latter, it is significant.
Since there are many genes that are very strongly correlated with my covariates, these are much higher ranked than the genes vs my outcome. As a result, in the p-value adjustment my most highly ranked p-value against the outcome variable is at rank 510 (!); while ranks 1-509 are all genes vs sex/batch. Thus, this gene at rank 510 is tested against a threshold value of 3e-04, and it is therefore reported as significant. Meanwhile, if the adjustment would only be done with the outcome variable, the threshold value for this gene (now at rank 1) would be 4e-6, and it is no longer significant.
Long story short, by including the tests agains the other covariates into the p-value adjustment, the number of significant results is inflated (in my case from 0 with correction for only the outocme to 65 with vanilla maaslin2). Could it be an idea to mention this in the manual and/or include this into a future version of the package?