Hi Developer,

I have used `humann_renorm_table`

to generate CPM normalized data. I would like to apply a distance measure to address two aspects: (1) assess beta diversity between groups and (2) identify differential pathways.

From what I understand, CPM represents count-per-million reads normalized data.

Could you advise on which distance metric would be most appropriate for downstream analysis, particularly for calculating distances on the CPM table and for conducting differential testing?

Thank you!

We favor the Bray-Curtis distance for this sort of data (and for microbiome beta diversity questions in general). Note that it will be cleaner to do your distance calculations on the community totals and not the (combined) stratified data.

Thank you, @franzosa. By “community totals,” are you referring to the combined pathways CPM table (the smaller matrix), rather than the larger matrix with individual components? For instance, pathway A would be the sum of A1, A2, A3, …, A10, and pathway B would be the sum of B1, B2, B3, …, B20, based on the HUMAnN3 output. So, when you mention community totals, do you mean the table containing pathway A and pathway B, rather than the table listing A1, A2, …, A10 and B1, B2, …, B20?

Also, which transformation would you recommend for differential pathway analysis?

Much appreciated!

I mean that when computing distances, you would want to work with the values like this:

```
PWY1 12.1
PWY2 25.2
```

And not the version with the stratifications included:

```
PWY1 12.1
PWY1|speciesA 10.1
PWY1|speciesB 2.0
PWY2 25.2
PWY2|speciesA 12.1
PWY2|speciesB 13.2
```

Or you could use just the stratified rows (i.e. the ones with `|`

in them), but you don’t want to mix them since they represent separate compositions over the data.

We typically use a log transformation for microbiome features. If you want to include 0s in the modeling, then we replace them with half the smallest non-zero value (on a per-feature basis) before taking the log. FYI in MaAsLin v3 we are moving toward modeling abundance on only the non-zero values (with a log transform) and separate logistic modeling of the zeros (presence/absence).

@franzosa Thank you so much!

I have another question. I noticed that the column sums for each sample in my CPM table are 1,000,000. When I divide my CPM values by 1,000,000, the resulting table closely resembles a relative abundance (TSS-normalized) table, column sum is 1 for each sample. However, based on the humann_renorm_table --units cpm tutorial, it states that the output isn’t TSS-normalized. Can I consider the CPM/1,000,000 table equals to proportion table? Thank you!

Yes, we tend to avoid those units because they are so tiny, but they are equivalent. If you want your units to sum to 1.0 you can also use the renorm script in “relab” (relative abudance) mode. Note that both of these are TSS: they just use different total sums (1 vs. 1e6).