Big differences in results depending on type of mapping used?

Hi all,

I’m trying to screen some metagenomics samples for nitrite reduction / NO-forming activity (see: KEGG ENZYME: 1.7.2.1) using Humann3.6. If I check the samples for the two KOs listed on that web page, there’s very little evidence of NO-forming genes. K00368 seems present in a small % of samples:

$ grep "K00368" uniref90_ko.cpm.unstratified.tsv | cut -f1-30
K00368	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1.35695	0	0	0	0

And K15864 appears to be entirely absent:

$ grep "K15864" uniref90_ko.cpm.unstratified.tsv
$

However if I look at the data by the E.C. entry (1.7.2.1), it appears to be both fairly abundant AND prevalent in the data:

$ grep "1\.7\.2\.1" uniref90_level4ec.cpm.unstratified.tsv | cut -f1-30
1.7.2.1	68.7	163.506	38.3125	160.131	99.7002	189.444	72.9189	79.5386	69.1712	0	143.998	0	94.3286	0	0	0	46.0324	96.4395	0	118.337	47.4957	8.24756	47.4995	49.5644	78.9407	103.753	99.8318	113.478	0

This would lead me to the opposite conclusion of the KO tables. Is there a reason why these results are so different? I would understand there being some minor differences but it seems to me that the different mappings should provide at least some level of consistency.

We’ve seen recently that the KO annotations in UniRef90 are actually quite sparse, and that might be part of what’s going on here. My suspicion is that UniProt is only importing KO annotations from KEGG reference genomes, rather than doing de novo KO prediction, hence you have to get somewhat lucky that the proteins you detect had KO annotations directly imported from KEGG. (Notably, as another user recently pointed out, UniProt appears to have stopped providing any KO annotations in recent releases, opting instead to point KEGG reference sequences directly.)

Something we’ve been experimenting with internally is allowing KO annotations to be shared among UniRef90s that belong to the same UniRef50, which appears to boost KO annotation coverage quite a bit without loss of specificity. Here is an expanded UniRef90-to-KO map based on this logic that is compatible with the humann_regroup_table script. I’ll be curious to know if using this file instead of the default one improves agreement with your EC-focused result.

Thanks for providing that file, the results do seem to be in better agreement:

$ grep "K00368" uniref90_ko_NEW.cpm.unstratified.tsv | cut -f1-30
K00368	67.9994	110.217	42.1048	165.697	96.1491	85.9958	32.9454	67.5835	58.2956	0	0	0	16.5542	0	0	0	47.7426	36.4944	0	66.8156	36.0168	36.9052	49.0502	17.9201	62.3122	34.2248	99.2732	61.0391	0

I didn’t realize that the KO annotations were imported directly from UniRef rather than predicted de novo – how feasible would something like this be? While there is a huge improvement with the new mapping file, I still would expect to see stronger agreement than this:

image

There’s a clear correlation, but there are a fair number of samples where either the KO or EC annotation is missing, or pretty far away from the diagonal

Glad that helped (even if it wasn’t a perfect solution). Running the HMM-based KO predictions on all of UniRef90 is a pretty big compute. That said, we’re working on something like that for HUMAnN 4 since we have a lot of new proteins in that release that needed annotations.