Humann_regroup_table for uniref90 KO

Hi,
I’m sorry if there was a previous post with this problem, but I couldn’t find any that could solve the issue.

I performed humann on my fasta file in Humann3, which generated the genefamilies.tsv, pathabundance.tsv and pathcoverage.tsv files as expected.

humann -i input_file.fasta -o humann_output_folder --search-mode uniref90 --nucleotide-database chocophlan_v201901b -protein-database uniref90_diamond

I wanted to run the humann_regroup_table command to compute KOs from the genefamilies, but the argument for --group uniref90_ko isn’t available.

  • I’ve successfully download the humann_database: humann_databases --download utility_mapping full $DIR
    However, I got an error that state ‘Unable to write on the Humann config file’. I presume that this may be due to the permission to read/write in selected folders?
  • How can I provide the file directly from the humann_database that I’ve downloaded (or if there is any mapping/reformatting file available) to perform the regroup as a workaround to the above to compute KOs?

Any advice is much appreciated. Thank you!

1 Like

Most of the scripts have a -c flag that you can use to pass a custom mapping file, but it will still work if you use one of the files from the utility mapping database. You can also try running the humann_config script (maybe as an admin?) to point it to the utility mapping folder that you downloaded and unpacked.

1 Like

Thank you very much @franzosa for your help! I tried with the -c flag on the utility mapping database that I downloaded as a local copy and it worked very well.

Of the features grouped 1+ times, I got 13.1% for uniref90_rxn but 4.5% for uniref90_ko. For features grouped 2+ times, I got 3.7% and 0% respectively.

Upon reading your paper on Humann (Nature Methods, 2018), it seems that metacyc reactions is the most highly resolved method for uniref90 gene families. Therefore would this be the reason that the grouping % is higher with Metacyc reactions compared to KOs, and are these levels reasonable or expected?

Thanks again!

1 Like

RXNS, ECs, and KOs are usually in the ~10% range. Hence the KO number there strikes me as low, but not in a seriously alarming way.

1 Like

Related to this, I have some UniRef90 IDs from HUMANn3 that don’t appear in utility_mapping/map_ko_uniref90.txt.gz

Is that expected because of low coverage of UniRef90 IDs for the KEGG mapping?

I also spot-checked one ID, UniRef90_X5NU12 on the UniProt website and it did not show up there either. Is that because of the difference in database versions?

That ID does show up in the GO mapping (utility_mapping/map_go_uniref90.txt.gz) and the EC mapping so I may end up using one of those instead.

Thanks

1 Like

A UniRef will only show up in the KO mapping file if its representative protein was annotated to a KO in UniProt. This is only true for ~10% of proteins. ECs tend to be similar, while Pfams and GO have higher coverage.

Because UniRef is re-clustered every month the representative sequences sometimes change. However, unless a sequence is retired, its protein entry will remain. X5NU12 may no longer be a cluster representative but you can still look up the protein in UniProt:

https://www.uniprot.org/uniprotkb/X5NU12/entry

1 Like