Hello,
I ran Humann3 on some metatranscriptomic samples using the UniRef90 db and then ran LEfSe on the table grouped by EggNog IDs (and others). I got three hits that don’t seem to have a name in the Eggnog mapping file and don’t find a hit on the EggNog website.
One example:
ENOG4105GMY, which can be found in the UniRef90_eggnog file:
zgrep ‘ENOG4105GMY’ map_eggnog_uniref90.txt.gz
This returns:
ENOG4105GMY UniRef90_A4EB58 UniRef90_A5Z326 UniRef90_A6P140 UniRef90_A7B6K2 UniRef90_A7VEM4 UniRef90_A8SIB4 UniRef90_A8SSN6 UniRef90_B0MGJ9 UniRef90_B2PU91 UniRef90_B5CUP9 UniRef90_B6G1S7 UniRef90_C0QQH9 UniRef90_C1DUU5 UniRef90_C4XFZ6 UniRef90_C9N1P8 UniRef90_E0NVY9 UniRef90_H6R3P7 UniRef90_Q2NJ05
But not the name mapping file:
zgrep ‘ENOG4105GMY’ map_eggnog_name.txt.gz
This command returns nothing.
The renaming command returns NO_NAME for these hits.
humann_rename_table -i no_unmapped_ungrouped/eggnog_cpm_genefamilies_joined_no_extra.tsv -n eggnog -o no_unmapped_ungrouped/renamed_eggnog_cpm_genefamilies_joined_no_extra.tsv
How do I find out what the names are? If there’s no way to directly find the name, is it possible to retrieve the sequence for that hit so I can BLAST it and try to determine it’s identity that way?
Thanks,
Samantha
We take the UniRef to eggNOG ID associations directly from UniProt, but we need to parse the eggNOG database to get the human-readable eggNOG names. It’s possible that the two are slightly out of sync, resulting in some UniProt-recognized eggNOGs that are no longer in the eggNOG database. One option would be to look at earlier versions of eggNOG to see if they are described there?
To get corresponding sequences, you can always take a UniRef entry like UniRef90_A4EB58 (from the mapping file) and look it up on UniProt like this:
https://www.uniprot.org/uniprot/A4EB58.fasta
to get a protein sequence. If you replace .fasta
with .txt
you’ll be shown the representative protein’s annotations rather than its sequence.
UniRef to eggNOG ID associations directly from UniProt
Where on UniProt do you get this info from? I can’t find it on the ftp server (Index of /pub/databases/uniprot/current_release/knowledgebase)
We parse all functional annotations from these two files (they also have XML equivalents if you prefer):
https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.dat.gz
Which are concatenations of per-protein files that look like this:
https://www.uniprot.org/uniprot/P11440.txt
And specifically the DR
lines that look like:
DR eggNOG; KOG0594; Eukaryota.
Note that UniRef90/50 are subsets of the sequences detailed in the above files. So, for example, the eggNOG annotation for UniRef90_XYZ is based on the entry for XYZ itself in the above files.
Awesome! Thanks for the detailed info!