When analysis the output humann3, should I first renorm the gene_family to cpm and then regroup to pathway, or first regroup to pathway and then renorm to cpm value?
In the humann3 manual (https://github.com/biobakery/humann), the manual suggest that it should first regroup to other functional category and then renorm to cpm.
But in humann3 tutorial (https://github.com/biobakery/humann), the tutorial suggest that it should first renorm to cpm and then regroup to other functional category.
These end up producing similar results. The pathway abundance that HUMAnN computes for you is based on the unnormalized gene families, such that both the gene families and pathways are in units of RPKs, and these can be normalized to CPMs or relative abundance to adjust for sequencing depth.
I have seen arguments that, philosophically, it may be better to normalize once for sequencing depth “as early as possible” in a pipeline (for HUMAnN, that would be the gene family level). This can produce some surprising results if a gene contributes to more than one broader function, however. For example, if I normalize my gene abundance to sum to 100%, and then I sum genes according to their Pfam domain membership, the total Pfam abundance would exceed 100% (because the average gene contains >1 Pfam domain). These Pfam values would still be safe to analyze - they have been adjusted for sequencing depth at least once - but the fact that their totals vary across samples can look a little strange.
Do you know if any of the functional classifications used by Humann3 to regroup Uniref90 protein families does a one-to-one functional assignment? Because due to the compositional nature of metagenomic data, counting reads more than once after regrouping would really mess with differential abundance analyses once I transform the RPK data with CLR, right?
I think only UniRef50s are guaranteed to be a N:1 map for the UniRef90s (that would exactly maintain the composition). Everything else as the potential for 1:N mappings, even if they are quite rare (e.g. ECs). You can always renormalize your regrouped abundances again if you want to force them to form a composition (as a convenience), but this is different from the (critical) normalization we use for gene families to capture the composition induced by sequencing.