Problem creating a custom DB with KEGG

Hi,

I’ve recently subscribed the KEGG database.
However, I am having trouble creating a humann2 custom database using it.

Especially, during following below part:

$ humann2_build_custom_database --input genes.pep --output custom_database --id-mapping legacy_kegg_idmapping.tsv --format diamond --taxonomic-profile max_taxonomic_profile.tsv

I am just wondering which format should be entered for the part of the ‘genes.pep’ file as input data (Does it need identifier? gene sequences? gene length? or any other things?).
I would appreciate if anyone could answer this question.
Thanks.

genes.pep should be a FASTA file whose sequence headers (i.e. the strings that appear after the > that begin sequence entries) appear in your legacy_kegg_idmapping.tsv file.

Thank you very much, Eric.

Kim

Hi @beeswaxag,

Could you give some details regarding how you generated the genes.pep as well as id_mapping from an updated KEGG database?

I also have acess to an updated version and would like to use it.

Thanks.

Florentin

I am having the same issue. Could you please expand on your response. Do I need to generate the FASTA file myself? If so, I am still not clear on what the content of the genes.pep file should be.

Hi everyone,

I Have access to a new version of KEGG and can’t seem to locate the input files for creating the id_mapping file in the command:
$ humann2_humann1_kegg --ikoc humann1/data/koc --igenels humann1/data/genels --o legacy_kegg_idmapping.tsv
Does any body know which KEGG database files correspond to the ‘humann1/data/koc’ and ‘humann1/data/genels’ used here?

Thanks

Hi All - It’s unfortunately hard for us to answer these questions as we don’t have access to a current KEGG license, and the official KEGG installation has likely evolved since the time of HUMAnN 1.0 (which is when these files were first built).

If you’re doing custom alignment in HUMAnN (to KEGG or otherwise) the goal is to have 1) a file with all your sequences whose headers appear in 2) the id-mapping file (along with columns for functional category and taxonomy). Historically “1” was a file called genes.pep in the KEGG installation, and the id-mapping could be built from other KEGG-supplied files.

Thanks!
I fully understand what you wrote but then what is ‘humann1/data/genels’ used for? Can you specify the headers for the two columns that appear in the file?

Thanks again,

Aya

That file appears to be a mapping from legacy KEGG sequence headers (as would appear in the genes.pep file) to their sequence length. You can inspect those files in the legacy HUMAnN repository (i.e. v1.0) here:

1 Like