Hi all,
I’m trying to screen some metagenomics samples for nitrite reduction / NO-forming activity (see: KEGG ENZYME: 1.7.2.1) using Humann3.6. If I check the samples for the two KOs listed on that web page, there’s very little evidence of NO-forming genes. K00368 seems present in a small % of samples:
$ grep "K00368" uniref90_ko.cpm.unstratified.tsv | cut -f1-30
K00368 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.35695 0 0 0 0
And K15864 appears to be entirely absent:
$ grep "K15864" uniref90_ko.cpm.unstratified.tsv
$
However if I look at the data by the E.C. entry (1.7.2.1), it appears to be both fairly abundant AND prevalent in the data:
$ grep "1\.7\.2\.1" uniref90_level4ec.cpm.unstratified.tsv | cut -f1-30
1.7.2.1 68.7 163.506 38.3125 160.131 99.7002 189.444 72.9189 79.5386 69.1712 0 143.998 0 94.3286 0 0 0 46.0324 96.4395 0 118.337 47.4957 8.24756 47.4995 49.5644 78.9407 103.753 99.8318 113.478 0
This would lead me to the opposite conclusion of the KO tables. Is there a reason why these results are so different? I would understand there being some minor differences but it seems to me that the different mappings should provide at least some level of consistency.
We’ve seen recently that the KO annotations in UniRef90 are actually quite sparse, and that might be part of what’s going on here. My suspicion is that UniProt is only importing KO annotations from KEGG reference genomes, rather than doing de novo KO prediction, hence you have to get somewhat lucky that the proteins you detect had KO annotations directly imported from KEGG. (Notably, as another user recently pointed out, UniProt appears to have stopped providing any KO annotations in recent releases, opting instead to point KEGG reference sequences directly.)
Something we’ve been experimenting with internally is allowing KO annotations to be shared among UniRef90s that belong to the same UniRef50, which appears to boost KO annotation coverage quite a bit without loss of specificity. Here is an expanded UniRef90-to-KO map based on this logic that is compatible with the humann_regroup_table
script. I’ll be curious to know if using this file instead of the default one improves agreement with your EC-focused result.
Thanks for providing that file, the results do seem to be in better agreement:
$ grep "K00368" uniref90_ko_NEW.cpm.unstratified.tsv | cut -f1-30
K00368 67.9994 110.217 42.1048 165.697 96.1491 85.9958 32.9454 67.5835 58.2956 0 0 0 16.5542 0 0 0 47.7426 36.4944 0 66.8156 36.0168 36.9052 49.0502 17.9201 62.3122 34.2248 99.2732 61.0391 0
I didn’t realize that the KO annotations were imported directly from UniRef rather than predicted de novo – how feasible would something like this be? While there is a huge improvement with the new mapping file, I still would expect to see stronger agreement than this:
There’s a clear correlation, but there are a fair number of samples where either the KO or EC annotation is missing, or pretty far away from the diagonal
Glad that helped (even if it wasn’t a perfect solution). Running the HMM-based KO predictions on all of UniRef90 is a pretty big compute. That said, we’re working on something like that for HUMAnN 4 since we have a lot of new proteins in that release that needed annotations.