Confusion with HUMAnN 'regroup_table' and higher-level pathway information

Hello! Thank you so much for your work and continued development of HUMAnN – it’s incredible! I have made it most of the way through the documentation/tutorial but have been running into some hiccups with subsequent steps/analysis. I think I may be misunderstanding some things (even in regards to which data to use, as I’m kind of new to metabolic pathway work), and am just hoping for some advice/clarification.

Briefly, I have ~270 human stool metagenomes from people of different health statuses (e.g. cases and controls) and am interested in comparing their metabolic potential. I know that HUMAnN outputs three different file types: gene families, pathway abundances, and pathway coverages. I think, for the purposes of doing a sort of broad, birds-eye level view of differences among our groups, we are most interested in the pathways files. BUT, I wasn’t sure if it’s a terrible error to disregard the gene families results, haha. So, the rest of this post dives into some nitty-gritty questions about that. I’m sorry, in advance, for the length:

I successfully ran through HUMAnN on our merged PE reads using UniRef90. Thus far, I was able to run ‘humann_renorm_table’ on each of my ‘genefamilies.tsv’ and ‘pathabundance.tsv’ to obtain CPM-normalized abundances for each sample. I then used ‘humann_join_tables’ to get three merged files (genefamilies, pathabundance, and pathcoverage) for all of my samples.

I then ran the ‘humann_regroup_table’ for the comprehensive ‘genefamilies’ file to get MetaCyc and/or Pfam groupings relevant to the UniRef output. Just to make sure I constructed this correctly, here is my code (for obtaining MetaCyc reactions):

humann_regroup_table --input $SCRATCH/humann/ERIN_humann_genefamilies_cpm_merged.tsv --groups uniref90_rxn --function sum --ungrouped Y --protected Y --output $SCRATCH/humann/ERIN_genefamilies_regrouped_MetaCyc.tsv;

The log for MetaCyc indicated this:

Original Feature Count: 2305037; Grouped 1+ times: 249669 (10.8%); Grouped 2+ times: 73840 (3.2%)

While the log for Pfam showed this:

Original Feature Count: 2305037; Grouped 1+ times: 1346840 (58.4%); Grouped 2+ times: 342782 (14.9%)

After viewing some other entries, it seems that these outcomes are pretty normal in terms of % grouped – is this correct?

At this point, I am honestly a little stumped on what to do next. For example, should I also be running ‘humann_regroup_table’ for the ‘pathabundance.tsv’ file as well to obtain similar annotations for the gene families and pathways? Or is this incorrect?

If it is recommended to incorporate the gene families into any sort of interpretation, I am very interested in classifying these gene families in accordance with the pathways in our samples. But is this an appropriate approach? I found relatively recent post (#2726) which was interested in finding which gene families contribute to pathways found in samples (is this right?). In this approach, I believe I’ve already successfully mapped the gene families to the MetaCyc reactions, but was confused on how to use the ‘metacyc_structured_pathways_filtered’ file in this context. Would I map the newly MetaCyc-annotated gene families file to this new file? Or filter my ‘pathabundance.tsv’ using this file somehow?

I had also found a couple of forum entries (namely #1830 and #1883) which mention assigning pathways to higher-level metabolic groups (which, if I’m understanding, doesn’t incorporate the gene families at all?). I realize some solutions were formed in these posts, but I felt a little lost when trying to apply those suggestions in my own pipeline.

For example, in Post 1830, the ’ map_metacyc-pwy_lineage.tsv was attached as a useful link between the MetaCyc pathways and higher-order groupings (thank you!). But I wasn’t sure how to apply this – I tried using this TSV file as a ‘–custom’ group in the ‘humann_regroup_table’ command with the ‘pathabundance.tsv’ file as input, but it only resulted in UNINTEGRATED or UNGROUPED fields collapsed by taxon; it didn’t conserve any of the pathway information. I’m guessing this was not the right way to go about it? Haha. Would it rather involve running some sort of ‘merge()’ function in Python or R externally to concatenate these assignments to the ‘pathabundance.tsv’ file for? (Sorry if this is a silly question, haha).

I apologize for the long post, especially if these questions are trivial. I think I just need some clarification on which of the output files may be most useful for my research goals as well as how to execute some of these previously suggested solutions. Thank you SO much for your time and help!