Confusion with HUMAnN 'regroup_table' and higher-level pathway information

Hello! Thank you so much for your work and continued development of HUMAnN – it’s incredible! I have made it most of the way through the documentation/tutorial but have been running into some hiccups with subsequent steps/analysis. I think I may be misunderstanding some things (even in regards to which data to use, as I’m kind of new to metabolic pathway work), and am just hoping for some advice/clarification.

Briefly, I have ~270 human stool metagenomes from people of different health statuses (e.g. cases and controls) and am interested in comparing their metabolic potential. I know that HUMAnN outputs three different file types: gene families, pathway abundances, and pathway coverages. I think, for the purposes of doing a sort of broad, birds-eye level view of differences among our groups, we are most interested in the pathways files. BUT, I wasn’t sure if it’s a terrible error to disregard the gene families results, haha. So, the rest of this post dives into some nitty-gritty questions about that. I’m sorry, in advance, for the length:

I successfully ran through HUMAnN on our merged PE reads using UniRef90. Thus far, I was able to run ‘humann_renorm_table’ on each of my ‘genefamilies.tsv’ and ‘pathabundance.tsv’ to obtain CPM-normalized abundances for each sample. I then used ‘humann_join_tables’ to get three merged files (genefamilies, pathabundance, and pathcoverage) for all of my samples.

I then ran the ‘humann_regroup_table’ for the comprehensive ‘genefamilies’ file to get MetaCyc and/or Pfam groupings relevant to the UniRef output. Just to make sure I constructed this correctly, here is my code (for obtaining MetaCyc reactions):

humann_regroup_table --input $SCRATCH/humann/ERIN_humann_genefamilies_cpm_merged.tsv --groups uniref90_rxn --function sum --ungrouped Y --protected Y --output $SCRATCH/humann/ERIN_genefamilies_regrouped_MetaCyc.tsv;

The log for MetaCyc indicated this:

Original Feature Count: 2305037; Grouped 1+ times: 249669 (10.8%); Grouped 2+ times: 73840 (3.2%)

While the log for Pfam showed this:

Original Feature Count: 2305037; Grouped 1+ times: 1346840 (58.4%); Grouped 2+ times: 342782 (14.9%)

After viewing some other entries, it seems that these outcomes are pretty normal in terms of % grouped – is this correct?

At this point, I am honestly a little stumped on what to do next. For example, should I also be running ‘humann_regroup_table’ for the ‘pathabundance.tsv’ file as well to obtain similar annotations for the gene families and pathways? Or is this incorrect?

If it is recommended to incorporate the gene families into any sort of interpretation, I am very interested in classifying these gene families in accordance with the pathways in our samples. But is this an appropriate approach? I found relatively recent post (#2726) which was interested in finding which gene families contribute to pathways found in samples (is this right?). In this approach, I believe I’ve already successfully mapped the gene families to the MetaCyc reactions, but was confused on how to use the ‘metacyc_structured_pathways_filtered’ file in this context. Would I map the newly MetaCyc-annotated gene families file to this new file? Or filter my ‘pathabundance.tsv’ using this file somehow?

I had also found a couple of forum entries (namely #1830 and #1883) which mention assigning pathways to higher-level metabolic groups (which, if I’m understanding, doesn’t incorporate the gene families at all?). I realize some solutions were formed in these posts, but I felt a little lost when trying to apply those suggestions in my own pipeline.

For example, in Post 1830, the ’ map_metacyc-pwy_lineage.tsv was attached as a useful link between the MetaCyc pathways and higher-order groupings (thank you!). But I wasn’t sure how to apply this – I tried using this TSV file as a ‘–custom’ group in the ‘humann_regroup_table’ command with the ‘pathabundance.tsv’ file as input, but it only resulted in UNINTEGRATED or UNGROUPED fields collapsed by taxon; it didn’t conserve any of the pathway information. I’m guessing this was not the right way to go about it? Haha. Would it rather involve running some sort of ‘merge()’ function in Python or R externally to concatenate these assignments to the ‘pathabundance.tsv’ file for? (Sorry if this is a silly question, haha).

I apologize for the long post, especially if these questions are trivial. I think I just need some clarification on which of the output files may be most useful for my research goals as well as how to execute some of these previously suggested solutions. Thank you SO much for your time and help!

2 Likes

Hello,

I find myself in a quite similar situation. I already conducted differential abundance analyses on the gene families at the PFAM level, and found thousands of significant difference with MaAslin. Very satisfying, although there are just too many associations to be represented easily.
Thus, I wanted to simplify the message, and read pretty much every post on Biobakery related to Humann2. My aim is quite simple : I’d like to simplify the message by regrouping the pathways of the humann_pathabundance.tsv file into higher levels.

I found, after some efforts, a mapping file that seemed appropriate to perform this. It is the mapping file provided by Chi, in his package MicroEco. The format should be correct for the purpose, as you can see in the joined file.
mapping.tsv (268.0 KB)

Then, I tried to regroup the PWY of my dataset into higher levels, using the mapping.tsv file. A subset of my data can be found here :
humann_pathabundance.tsv (769.9 KB)
Here is the formula and the verbose :

humann_regroup_table --input humann_pathabundance.tsv \
    --output grouped.tsv --ungrouped N --custom mapping.tsv

Loading table from: humann_pathabundance.tsv
  Treating humann_pathabundance.tsv as stratified output, e.g. ['UNINTEGRATED', 'g__Abiotrophia.s__Abiotrophia_defectiva']
Loading custom groups file: mapping.tsv
Loading mapping file from: mapping.tsv
Original Feature Count: 502; Grouped 1+ times: 2 (0.4%); Grouped 2+ times: 0 (0.0%)

Basically, every PWY was “UNINTEGRATED”, with a row for each species discovered.
I am a newcomer in bioinformatics, thus I can probably miss something obvious. I tried to formate the pathabundance table and remove every stratified level, and to remove the text contained between the “:” and the first tabulation of each row, but it didn’t work.

One may hope that you found a fix for this situation, Zoé ? Or anyone reading this ?

My thanks