I wanted to run the humann_regroup_table command to compute KOs from the genefamilies, but the argument for --group uniref90_ko isn’t available.
I’ve successfully download the humann_database: humann_databases --download utility_mapping full $DIR
However, I got an error that state ‘Unable to write on the Humann config file’. I presume that this may be due to the permission to read/write in selected folders?
How can I provide the file directly from the humann_database that I’ve downloaded (or if there is any mapping/reformatting file available) to perform the regroup as a workaround to the above to compute KOs?
Most of the scripts have a -c flag that you can use to pass a custom mapping file, but it will still work if you use one of the files from the utility mapping database. You can also try running the humann_config script (maybe as an admin?) to point it to the utility mapping folder that you downloaded and unpacked.
Thank you very much @franzosa for your help! I tried with the -c flag on the utility mapping database that I downloaded as a local copy and it worked very well.
Of the features grouped 1+ times, I got 13.1% for uniref90_rxn but 4.5% for uniref90_ko. For features grouped 2+ times, I got 3.7% and 0% respectively.
Upon reading your paper on Humann (Nature Methods, 2018), it seems that metacyc reactions is the most highly resolved method for uniref90 gene families. Therefore would this be the reason that the grouping % is higher with Metacyc reactions compared to KOs, and are these levels reasonable or expected?
Related to this, I have some UniRef90 IDs from HUMANn3 that don’t appear in utility_mapping/map_ko_uniref90.txt.gz
Is that expected because of low coverage of UniRef90 IDs for the KEGG mapping?
I also spot-checked one ID, UniRef90_X5NU12 on the UniProt website and it did not show up there either. Is that because of the difference in database versions?
That ID does show up in the GO mapping (utility_mapping/map_go_uniref90.txt.gz) and the EC mapping so I may end up using one of those instead.
A UniRef will only show up in the KO mapping file if its representative protein was annotated to a KO in UniProt. This is only true for ~10% of proteins. ECs tend to be similar, while Pfams and GO have higher coverage.
Because UniRef is re-clustered every month the representative sequences sometimes change. However, unless a sequence is retired, its protein entry will remain. X5NU12 may no longer be a cluster representative but you can still look up the protein in UniProt: