Dear developers and users,
In HUMAnN3 in
utility_mapping accessory files, there is file called
map_level4ec_uniref90.txt.gz which maps EC-enzymes (4-numbers) to UniRef90 protein ids.
I was wondering how/from where I can generate such a mapping table myself? If for example I’d like to use the latest UniRef90 (which is updated ~8 weeks) database, and want to map them the ECs annotation.
I know this topic is not directly related to HUAMnN3, but still I’d appreciate any help!
We build these files by parsing the big “DAT” file that comes with each UniProt release. It is the file that looks like a concatenation of all these per-protein details:
Most EC annotations come from the lines starting with
DE, but they can occasionally be found in comment lines (
CC) and via the cross-references to the BRENDA database (
Many thanks you for the answer @franzosa !
I can’t access the parent directory of this link though. It was just an example for a single record yes?
Do you refer to this huge file that come with each UniProtKB release: uniprot_trembl.dat.gz ?
There is also a much smaller
uniprot_sprot.dat.gz - but it only covers the Swiss-Prot and I guess it will result in only a partial mapping?
Correct, that was just an example of the formatting. You will want to consider both the full SwissProt and TrEMBL files. Note that if HUMAnN reports a gene family like
XYZ will be an accession number in one of those files (unless the sequence has been retired).