I’m using HUMAnN for metagenomic analysis, and I’ve noticed that in the output files there are multiple entries for the same reaction with identical names and values.
I found a similar situation discussed in this post: Identical values of different RXN number (with the same gene family), but it seems that the question has not been fully resolved.
What I’ve done so far:
I tested this using a smaller version of my dataset with two samples (test1 and test2).
humann --input ../fastq/test1.fastq.gz --output .
humann_regroup_table -i test1_genefamilies.tsv -g uniref90_rxn -o test1_rxn.tsv
humann_renorm_table --input test1_rxn.tsv --output test1_rxn_cpm.tsv --units cpm --special n --update-snames
humann_join_tables -i . --file_name _rxn_cpm.tsv -o all_rxn_cpm.tsv
humann_rename_table -i all_rxn_cpm.tsv -n metacyc-rxn -o all_rxn_cpm_renamed.tsv
humann_split_stratified_table -i all_rxn_cpm_renamed.tsv -o split
Here is an example from split/all_rxn_cpm_renamed_unstratified.tsv, showing multiple entries with the same reaction name and identical CPM values:
RXNQT-4165: (expasy) 3-isopropylmalate dehydrogenase [] 1163.72 914.825
RXNQT-4168: (expasy) 3-isopropylmalate dehydrogenase [] 1163.72 914.825
RXNQT-4171: (expasy) 3-isopropylmalate dehydrogenase [] 1163.72 914.825
RXNQT-4174: (expasy) 3-isopropylmalate dehydrogenase [] 1163.72 914.825
RXNQT-4178: (expasy) 3-isopropylmalate dehydrogenase [] 1163.72 914.825
My questions are:
Why do these duplicate-like entries occur? Could this be due to a single UniRef ID mapping to multiple reaction IDs (with the same name), or is there another reason?
How should I handle them in downstream analyses? Is it recommended to keep them as-is, or should I merge the duplicate-like entries and renormalize the data? I’m concerned that these entries might overemphasize the influence of this reaction in analyses like PCoA or differential abundance comparisons using the normalized values.
Thank you in advance for your support.
Best regards,