Should I first renorm gene_family to CPM and then regroup to pathway or on the contrary？

zhq · September 29, 2022, 7:11am

When analysis the output humann3, should I first renorm the gene_family to cpm and then regroup to pathway, or first regroup to pathway and then renorm to cpm value?
In the humann3 manual (https://github.com/biobakery/humann), the manual suggest that it should first regroup to other functional category and then renorm to cpm.

But in humann3 tutorial (https://github.com/biobakery/humann), the tutorial suggest that it should first renorm to cpm and then regroup to other functional category.

franzosa · October 5, 2022, 1:07pm

These end up producing similar results. The pathway abundance that HUMAnN computes for you is based on the unnormalized gene families, such that both the gene families and pathways are in units of RPKs, and these can be normalized to CPMs or relative abundance to adjust for sequencing depth.

I have seen arguments that, philosophically, it may be better to normalize once for sequencing depth “as early as possible” in a pipeline (for HUMAnN, that would be the gene family level). This can produce some surprising results if a gene contributes to more than one broader function, however. For example, if I normalize my gene abundance to sum to 100%, and then I sum genes according to their Pfam domain membership, the total Pfam abundance would exceed 100% (because the average gene contains >1 Pfam domain). These Pfam values would still be safe to analyze - they have been adjusted for sequencing depth at least once - but the fact that their totals vary across samples can look a little strange.

zhq · October 8, 2022, 1:50am

I have tried both methods and it produce difference results, so it puzzles me a lot.
Thanks for your reply! It’s really helpful.

FelipeMSD · November 13, 2023, 2:11pm

Hi @franzosa,

Do you know if any of the functional classifications used by Humann3 to regroup Uniref90 protein families does a one-to-one functional assignment? Because due to the compositional nature of metagenomic data, counting reads more than once after regrouping would really mess with differential abundance analyses once I transform the RPK data with CLR, right?

franzosa · November 29, 2023, 6:47pm

I think only UniRef50s are guaranteed to be a N:1 map for the UniRef90s (that would exactly maintain the composition). Everything else as the potential for 1:N mappings, even if they are quite rare (e.g. ECs). You can always renormalize your regrouped abundances again if you want to force them to form a composition (as a convenience), but this is different from the (critical) normalization we use for gene families to capture the composition induced by sequencing.

FelipeMSD · December 5, 2023, 5:02pm

All right! Thank you very much!

Topic		Replies	Views
Confusion with HUMAnN 'regroup_table' and higher-level pathway information HUMAnN	1	1206	February 2, 2024
Humann_renorm_table: sum>1 HUMAnN	5	907	September 22, 2023
Gene family analysis FDR correction- nothing is significant HUMAnN	2	381	March 25, 2022
Which database should I use to run `humann_regroup_table` and `humann_rename_table` command? HUMAnN	5	2766	October 19, 2020
Differential abundance testing after humann3 HUMAnN	1	1806	September 30, 2021

Should I first renorm gene_family to CPM and then regroup to pathway or on the contrary？

Related topics