Reference level changes lead to different pairwise differential abundance results in Maaslin3

I have a question regarding the effect of changing the reference group in a categorical variable.

I am analyzing three groups: right, left, and rectum, using Maaslin3 with:

  • fixed effect of interest: tissue location (right, left, rectum)
  • covariates: sex, BMI, and age
  • model: abundance
  • significance threshold: qval_individual < 0.1

As I understand it, Maaslin3 uses the first factor level as the reference group.

When I set right as the reference level, Maaslin3 compares:

  • left vs right
  • rectum vs right

When I set rectum as the reference level, Maaslin3 compares:

  • right vs rectum
  • left vs rectum

However, the number of significant differential features detected for the right vs rectum comparison is different between these two runs, even though biologically this should represent the same pairwise comparison.

For example:

  • using right as reference gives results for rectum vs right
  • using rectum as reference gives results for right vs rectum

but the significant feature counts are not identical.

Is this expected behavior in Maaslin3?

If so, could you explain why changing the reference level changes the number of detected features for what appears to be the same pairwise comparison?

I am wondering whether this could be related to:

  • model parameterization or contrast coding,
  • how qval_individual is calculated,
  • multiple testing correction being applied separately to different coefficient sets,
  • interaction with covariates (sex, BMI, age),
  • prevalence/filtering procedures,
  • or some other aspect of the Maaslin3 implementation.

Thank you very much for your help.

Hi,

A few questions:

  1. Are the differences you’re observing only in the significance test or also the coefficients too?
  2. How different are the results in the two cases?

If you’d prefer to post a chunk of the two results here or email me at willnickols@g.harvard.edu, I can taker a closer look. I would think the 2 models should give equivalent results, but there might be quirks in the median comparison producing differences.

Will