Duplicate-like Reaction Entries with Identical CPM Values in HUMAnN Output

Hello,

I’m using HUMAnN for metagenomic analysis, and I’ve noticed that in the output files there are multiple entries for the same reaction with identical names and values.

I found a similar situation discussed in this post: Identical values of different RXN number (with the same gene family), but it seems that the question has not been fully resolved.

What I’ve done so far:
I tested this using a smaller version of my dataset with two samples (test1 and test2).

humann --input ../fastq/test1.fastq.gz --output .
humann_regroup_table -i test1_genefamilies.tsv -g uniref90_rxn -o test1_rxn.tsv
humann_renorm_table --input test1_rxn.tsv --output test1_rxn_cpm.tsv --units cpm --special n --update-snames
humann_join_tables -i . --file_name _rxn_cpm.tsv -o all_rxn_cpm.tsv
humann_rename_table -i all_rxn_cpm.tsv -n metacyc-rxn -o all_rxn_cpm_renamed.tsv
humann_split_stratified_table -i all_rxn_cpm_renamed.tsv -o split

Here is an example from split/all_rxn_cpm_renamed_unstratified.tsv, showing multiple entries with the same reaction name and identical CPM values:

RXNQT-4165: (expasy) 3-isopropylmalate dehydrogenase [1.1.1.85]   1163.72   914.825  
RXNQT-4168: (expasy) 3-isopropylmalate dehydrogenase [1.1.1.85]   1163.72   914.825  
RXNQT-4171: (expasy) 3-isopropylmalate dehydrogenase [1.1.1.85]   1163.72   914.825  
RXNQT-4174: (expasy) 3-isopropylmalate dehydrogenase [1.1.1.85]   1163.72   914.825  
RXNQT-4178: (expasy) 3-isopropylmalate dehydrogenase [1.1.1.85]   1163.72   914.825  

My questions are:

  • Why do these duplicate-like entries occur? Could this be due to a single UniRef ID mapping to multiple reaction IDs (with the same name), or is there another reason?

  • How should I handle them in downstream analyses? Is it recommended to keep them as-is, or should I merge the duplicate-like entries and renormalize the data? I’m concerned that these entries might overemphasize the influence of this reaction in analyses like PCoA or differential abundance comparisons using the normalized values.

Thank you in advance for your support.

Best regards,
Matsumoto

It looks like what is happening is that MetaCyc has attached one EC (1.1.1.85) to all of these reactions. Since HUMAnN quantifies gene families and then regroups them into ECs, all such reactions will get the same abundance. If such cases are rare I wouldn’t worry about it too much. If you do want to remove them, you could do some clustering on the features based on correlation and only keep one feature from each highly correlated group.

Thank you for your response! I will consider removing the duplicates.

Currently, I process the *_genefamilies.tsv from HUMAnN in the following order:

humann_renorm_table → humann_join_tables → humann_rename_table → humann_split_stratified_table.

Would it be appropriate to remove the duplicate rows after this step and then run humann_renorm_table again?

Are you working with HUMAnN v3 or v4? If it’s v4 then you don’t need to do any extra normalization. If it’s v3 then you can normalize genes for sequencing depth and then regroup to reactions or ECs. In either case you can just drop the duplicate features since you already did the depth normalization once at the gene level. If you want the reactions themselves to form a composition (i.e. add up to 100%), then you would want to (re)normalize after dropping the duplicate rows.

Thank you for the clarification! I appreciate your help and will consider how to proceed.