I am just wondering which format should be entered for the part of the ‘genes.pep’ file as input data (Does it need identifier? gene sequences? gene length? or any other things?).
I would appreciate if anyone could answer this question.
Thanks.
genes.pep should be a FASTA file whose sequence headers (i.e. the strings that appear after the > that begin sequence entries) appear in your legacy_kegg_idmapping.tsv file.
I am having the same issue. Could you please expand on your response. Do I need to generate the FASTA file myself? If so, I am still not clear on what the content of the genes.pep file should be.
I Have access to a new version of KEGG and can’t seem to locate the input files for creating the id_mapping file in the command: $ humann2_humann1_kegg --ikoc humann1/data/koc --igenels humann1/data/genels --o legacy_kegg_idmapping.tsv
Does any body know which KEGG database files correspond to the ‘humann1/data/koc’ and ‘humann1/data/genels’ used here?
Hi All - It’s unfortunately hard for us to answer these questions as we don’t have access to a current KEGG license, and the official KEGG installation has likely evolved since the time of HUMAnN 1.0 (which is when these files were first built).
If you’re doing custom alignment in HUMAnN (to KEGG or otherwise) the goal is to have 1) a file with all your sequences whose headers appear in 2) the id-mapping file (along with columns for functional category and taxonomy). Historically “1” was a file called genes.pep in the KEGG installation, and the id-mapping could be built from other KEGG-supplied files.
Thanks!
I fully understand what you wrote but then what is ‘humann1/data/genels’ used for? Can you specify the headers for the two columns that appear in the file?
That file appears to be a mapping from legacy KEGG sequence headers (as would appear in the genes.pep file) to their sequence length. You can inspect those files in the legacy HUMAnN repository (i.e. v1.0) here: