Humann2_regroup_table for kegg : UNGROUPED!

Hi, I’m using humann2_regorup_table to convert Uniref90 into KO with mapping file “map_ko_uniref90.txt.gz” , my script is here:

humann2_regroup_table -i input_genefamily.tsv -c map_ko_uniref90.txt.gz -o output.tsv

But i found most of the reads are UNGROUPED in the output file.

And then i found 317015 out of 352836 uniref90_ids in input_genefamily.tsv don’t have any record in the mapping file “map_ko_uniref90.txt.gz” .

I also tried to build a custom-protein-reference-database of kegg and get the KO abundance output with MinPath manually , and when i compared the new output with the former, they don’t show any consistency or statistical correlation.

Is it reasonable to get the KO abundance from the original genefamily table using humann2_regroup_table ?
Do you have any suggestion about how to get the KO abandance of a metagenome sample?

Thanks for your reply!

This is a reasonable approach. KO annotations are relatively rare among UniRef90s (perhaps 10%?), which is why you’re seeing so many UniRef90s not regrouped to KOs. The same is true of ECs. If you want to regroup to something broader than a UniRef90, you can use UniRef50s or (my preference) Pfam domains. Both UniRef50s and Pfams have strong coverage of UniRef90 but Pfams are less numerous and better annotated.

1 Like

it seems regrouping uniref90s to KO not a good idea, thank for your answer anyway!

Regrouping tends to introduce a trade-off: you end up with a smaller number of better understood features (e.g. KOs) but you lose the resolution and coverage provided by the original UniRef90 units. I do tend to analyze some sort of regrouped features by default unless I’m specifically interested in pointing out individual genes for follow-up analysis. For example, if you want to knock out a potentially interesting gene from a bug of interest, then analysis at the UniRef90 level would make more sense.