Filter-normalize order and comments on tutorial

Dear developers, thank you very much for developing such a great and versatile package. Our group works on respiratory microbiota data and I frequently advise colleagues to use MaAsLin2 over other packages.

However, there are related two issues I recently came across, that I think should either be changed or should be more explicitly underscored in the tutorial.

  1. Filtering occurs before normalization. In the case of raw/count input data combined with for example TSS-normalization, this seems odd, as the relative abundances of features/taxa would be vastly different depending on the min_abundance/min_prevalence and min_variance-settings. This also means that p-values are shifting depending on the presence of other data fed to the function (and depend on the amount of filtering).
    Generally in our lab, we are used to first converting to relative abundance and then filtering (so that a given feature, say ‘Streptococcus_1’, is present in 20% in a particular sample and not 17%/15%/26% depending on the filtering parameters).
  2. In the tutorial the selection of a correct normalization method is well explained. Yet, the example data (HMP2_taxonomy.tsv/HMP2_metadata.tsv) includes TSS-normalized data, while the developers used default normalization in the tutorial (which is TSS). This implies that the data are TSS-normalized twice (which is inappropriate combined with the min_prevalence-filtering step which is also applied by default). I think this should be changed. In addition, the developers could consider to change the default TSS-normalization to NONE, advising users to in principle use prenormalized data or consider the implications of not normalising themselves.

Curious to hear your views on this and happy to discuss further if needed.

Hi @wsteenhu,

Thanks for the kind words and recommendation.

To your points about the tools tutorial/operations. You are absolutely correct on both points. We are working on fixing both of these issues currently.
For 1. We are changing the order of normalization/filtering. We are also considering implementing some more warnings/errors within MaAsLin architecture to warn users of common issues here. For the most part internally we feed MaAsLin already filtered/normalized data as you suggested was common practice, but we do want MaAsLin to be doing the correct procedure.

For 2. Again this is correct since we are working with MetaPhlAn output it is already normalized thus we should turn the TSS off in the tutorial. We will make this fix and make this portion of the tutorial more verbose.

Thanks again for the kind words and the suggestions.