The bioBakery help forum

Compatibility of Humann2 Output Files for Maaslin2

Hi,

I’d like to clarify which of Humann2’s output files are compatible with Maaslin2, and what normalization/transformations are needed. I ran multiple samples through Humann2, merged results across samples, and normalized the merged tables by both cpm and relative abundance. I then separately tried Humann2’s pathway abundance output and pathway coverage output as inputs for Maaslin2. I did not get any significant results with Humann2 pathway abundance dataset under the following Maaslin2 parameters (normalization = “NONE”, analysis_method = “CPLM”, other parameters set to default settings). However, I got significant hits with Humann2 pathway coverage data using the same set of Maaslin2 parameters.

Related to the Maaslin2 inputs and parameters I tried above, can I clarify:

  1. Should I be using Humann2’s pathway coverage or pathway abundance data as input for Maaslin2, and should the data be normalized by cpm or relative abundance?
  2. After normalizing Humann2 pathway abundance and coverage data by cpm/relative abundance, is normalization still needed for Humann2 pathway abundance/coverage datasets when used as input for Maaslin2?
  3. How do I determine whether to use LM or CPLM for analysis? I selected CPLM because my datasets look like there are multiple values close to zero.

Thank you for your help!
Germaine
Graduate Student, UCSF

Dear Germaine,

I stongrly encourage you to check these materials to be able to process humann2 pathway information with R.

Best,

Florentin

Hi Florentin,

Thank you for the very helpful resources! I’d like to follow up on some of my previous questions in light of the examples shared in these resources. Please refer to the bullet points below:

1. Should I be using Humann2’s pathway coverage or pathway abundance data as input for Maaslin2, and should the data be normalized by cpm or relative abundance?

  • I saw that you used Humann2’s pathway abundance output normalized by relative abundance as input for Maaslin2. Is Humann2’s pathway coverage output also a suitable input for Maaslin2?
  • Does it matter whether the data fed into Maaslin2 is normalized by cpm or by relative abundance/proportion? From your example code, it looks like Maaslin2 only accepts data normalized to relative abundance per sample.
  1. Thanks for clarifying that no further normalization is needed when feeding Humann2 output into Maaslin2 if the tables have already been normalized by cpm/relative abundance.

3. How do I determine whether to use LM or CPLM for analysis? I selected CPLM because my datasets look like there are multiple values close to zero.

  • I noticed that the MGX dataset you used in your example similarly has multiple values that are zero or close to zero, and with this dataset you used the default LM analysis option within Maaslin2. When should the analysis method be changed to CPLM instead?
    Thank you for your help!

Germaine

Hi @germaine260 - I wanted to quickly chime in to answer some of your model-specific questions. My answers do not address the appropriateness of a specific option which usually depends on the analysis question and in most cases, there is no “gold standard”.

Regarding whether to use ‘LM’, ‘CPLM’, or other models, intuitively, CPLM or a zero-inflated alternative should perform better in the presence of zeroes but based on our simulation benchmarking, we do not have evidence that CPLM is significantly better than LM. Having said that, there could be circumstances where a non-LM statistical model is more appropriate. I suggest you run multiple models and decide yourself based on the results you get as there is no single best model for all scenarios. I hope it makes sense!

Hi @germaine260,

I have just forked the repo from biobakery group to update the links of the data, so all credits from them regarding the script.

Is Humann2’s pathway coverage output also a suitable input for Maaslin2?

I am not shure, I have strated with pathway abundance and gene families.

Does it matter whether the data fed into Maaslin2 is normalized by cpm or by relative abundance/proportion? From your example code, it looks like Maaslin2 only accepts data normalized to relative abundance per sample.

From the example at the pathway level, the unstrattified table is normalized to 1 after removing unmmaped unintegrated pathways and before beta/alpha diversity as well as Maaslin2. It makes sense to me.

Best

Thanks Florentin and Himel for your useful advice!

@germaine260 If you’ve normalized HUMAnN pathways with the renorm_table script then you don’t need to perform any additional normalization – the effects of sequencing depth will have already been removed. I tend to work with CPM units because I find them to be more convenient, but they are numerically equivalent to relative abundances (RA) for modeling purposes (CPM = RA * 1e6).

The pathway coverage values do not need to be normalized – they are defined in [0,1] as a measure of confidence in the presence of a pathway. They are provided mainly as a diagnostic; I can’t think of a case where we performed statistical tests against them.

Thanks for clarifying Eric!

Hi Eric,

If I am not wrong, normalizing HUMANnN pathways with renorm_table will keep unmapped and unintegrated pathway, right?
If you would like to investigate changes in proportion or beta-diversity it might be more informative to remove the unmapped and unintegrated prior to normalisation?
Does it make sense?

It is actually what is done here https://github.com/biobakery/omnibus-and-maaslin2-rscripts-and-hmp2-data

Best,

Florentin

@fconstancias Correct - the default is to keep “special” abundances (like UNMAPPED) in the table. There is also an option to remove them. Whether or not you keep them depends on whether or not you think changes in those quantities are biologically interesting (keeping/removing are both justifiable). I typically remove them from the pathways tables when doing abundance-weighted diversity analyses so that they don’t dominate the calculation. However, they can be useful to retain for linear modeling as they limit the potential for housekeeping functions to artificially inflate in less-well-characterized communities.

Makes sense. Thanks for your input @franzosa