Choosing analysis method for maaslin2

Hello and thank you for this great analysis tool!
I am trying to run maaslin2 with R, however, I would like to try and run other methods, other than LM.
What are the considerations for the proper normalization/transformation that goes with the various methods?
For instance, when I try to run the NEGBIN model, I receive an error message that the transformation is not appropriate.
Could you please refer me to an explanation of the various methods?

Thank you.
Best,
Lena

Hi @Lena_Lapidot - apologies that we have not documented this part of the functionality well in our current MaAsLin 2 tutorial. I hope the following is helpful when choosing the right combination of statistical model, normalization, and transformation.

  • For statistical models, if your input is count, then you can use NEGBIN and ZINB, whereas, for non-count input, you can use LM and CPLM.

  • Apart from the statistical models, you need to pay close attention to whether the selected normalization and transformation options are valid with respect to the input requirement above.

  • Among the normalization approaches implemented in MaAsLin 2, TMM and CSS only work on counts and they also return normalized counts unlike TSS and CLR. Therefore, if your input is count, you can use the above two normalizations (i.e., TMM, CSS, or NONE (in case the data is already normalized)) without a further transformation (i.e. transform = 'NONE').

  • Among the non-count models, CPLM requires the data to be positive. Therefore, any transformation that produces negative values will typically NOT work for CPLM.

  • All the non-LM models use an intrinsic log link transformation due to their close connection to GLMs and they are recommended to be run with transform = 'NONE'.

  • Apart from that, LM is the only model that works on both positive and negative values (following normalization/transformation) and you have more wiggle room to vary the corresponding parameters which are typically limited for non-LM models.

I know it’s a lot of information but I hope this helps. Please let us know if you have any follow-up questions or if you encounter any issues with the alternative non-default models.

All the best,
Himel

Thank you for the great explanation, Himel!
I have the raw abundances of 16S bacterial sequencing, obtained from fecal samples. I want to analyze them collapsed at the genus level. My fixed effects all have positive values (treated as continuous variables).

What is the best approach in this case? LM runs smoothly, however, when I look at the scatterplots, some of taxa look like LM is not the best fit.
On the other hand, when I try the NEGBIN model, I get errors regarding the proper normalization and transformation needed.

Thank you,
Best,
Lena

Hi @Lena_Lapidot - as described above, for NEGBIN, you need to apply one of the three normalizations (i.e., TMM, CSS, or NONE) without a transformation (i.e. transform = 'NONE' and normalization = 'CSS’ or normalization = 'TMM' or normalization = 'NONE') in your MaAsLin 2 call. Note that, normalization = 'NONE' assumes that the data is already normalized.

Hi Himel,
I got it, it works!

Thank you :pray: :blush:

Hi @himel.mallick, can you please specify the difference between counts and non-counts?

As an example, relative abundances are not counts, and so are log-transformed abundances. When the measurements are count integers without further processing to make them continuous, they remain counts. Otherwise, they are no longer counts.

Many thanks,
Himel

Hi @himel.mallick,

Thanks for the great explanation of the methods. How would you go about assessing the performance of the model you’ve chosen for a particular data set? I.e. if you’ve got count data, you could use either NEGBIN or ZINB…but what would be the factors that determine which one you choose?

I have seen on other threads you’ve stated:

we usually don’t recommend one model over the others and leave it to the user’s best judgment

I’m keen to learn how to improve my judgement capabilities :blush:

1 Like

Hi @Matt - this a difficult question to answer and various heuristic avenues exist with no universally right answer.

In the past, in my own analysis, I have run multiple models and chosen the final model based on a quick side-by-side comparison of various modeling options and the biological meaningfulness of the results.

One slightly advanced way to eliminate methods would be to conduct a shuffle data analysis (similar to the one in the original MaAsLin2 paper, See Results) to get a sense of which method tends to pick up false positive signals in the repeatedly shuffled permuted datasets where no significant associations are expected.

If you are still left with multiple choices, one crude way would be to somehow combine the results to look for consistent signals. A very recent paper did something similar to bypass the task of choosing an optimal normalization method and instead combined the results from various methods to come to a consensus, which has also been recommended in another recent paper.

Again, this is an open area of research, and I am sure there will be more elegant solutions in the future but in general, a combination of the above should give you a good start.

I hope this helps in somewhat improving your judgment capabilities next time you analyze your data.

Thanks for asking this well-thought question,
Himel

1 Like