Only ~3% of UniRef90 gene families are regrouped to KO in HUMAnN3 — advice appreciated

Hello,

I am currently analyzing metagenomic data using HUMAnN3. After generating and merging the gene families table, I attempted to regroup UniRef90 gene families to KEGG Orthologs (KOs) with the following command:

humann_regroup_table --input genefamilies.tsv --output genefamilies_ko.tsv --groups uniref90_ko

The output included the message:

Original Feature Count: 503760; Grouped 1+ times: 16702 (3.3%); Grouped 2+ times: 78 (0.0%)

indicating that only about 3.3% of the UniRef90 features were assigned to KOs.

I am using the UniRef90 database version uniref90_201901b_full.dmnd, which I believe is up-to-date.
The input gene families table appears to be correctly generated and contains tens of thousands of UniRef90 IDs.
No errors or warnings occurred during the regrouping or normalization steps.

However, this KO assignment rate (~3.3%) seems unexpectedly low compared to literature reports and other analyses, where KO mapping rates often range between 50% and 80%.

Could you please advise on:

  1. Common reasons or factors that might cause such a low KO regrouping rate?
  2. Recommended checks or steps to troubleshoot or improve the KO assignment?
  3. Whether the sample type or environment could significantly affect the KO mapping rate?

Any insights or suggestions would be greatly appreciated.

Thank you very much for your support!


Let me know if you want me to help you post it or adapt it further!

KO mapping rates in HUMAnN are unfortunately limited by what we can source from UniProt, which I believe only includes KO annotations for reference proteomes from KEGG. You can get a higher annotation rate if you’re directly running KOfam on your proteins (for example), which might be what the literature reports are referencing (though 50-80% still feels high to me)?

Something we’ve used internally to improve this (which works very well) is to allow KO annotations to be shared within UniRef50 clusters. E.g. If a UniRef90 (A) is annotated to KO X, then figure out A’s corresponding UniRef50 (B), and then assign all UniRef90s in B to KO X. You can do this by combining the UniRef90_KO and UniRef90_UniRef50 maps provided with HUMAnN, and it works because KO annotations within UniRef50s are extremely homogeneous (which you will see upon inspecting the mapping files).