But i found most of the reads are UNGROUPED in the output file.
And then i found 317015 out of 352836 uniref90_ids in input_genefamily.tsv don’t have any record in the mapping file “map_ko_uniref90.txt.gz” .
I also tried to build a custom-protein-reference-database of kegg and get the KO abundance output with MinPath manually , and when i compared the new output with the former, they don’t show any consistency or statistical correlation.
Is it reasonable to get the KO abundance from the original genefamily table using humann2_regroup_table ?
Do you have any suggestion about how to get the KO abandance of a metagenome sample?
This is a reasonable approach. KO annotations are relatively rare among UniRef90s (perhaps 10%?), which is why you’re seeing so many UniRef90s not regrouped to KOs. The same is true of ECs. If you want to regroup to something broader than a UniRef90, you can use UniRef50s or (my preference) Pfam domains. Both UniRef50s and Pfams have strong coverage of UniRef90 but Pfams are less numerous and better annotated.
Regrouping tends to introduce a trade-off: you end up with a smaller number of better understood features (e.g. KOs) but you lose the resolution and coverage provided by the original UniRef90 units. I do tend to analyze some sort of regrouped features by default unless I’m specifically interested in pointing out individual genes for follow-up analysis. For example, if you want to knock out a potentially interesting gene from a bug of interest, then analysis at the UniRef90 level would make more sense.