Humann uses outdated uniref50 mapping file for mapping Uniref50 ids to names by default

I was running humann with the Uniref50 database.
I noticed some weird behavior of Humann. It looks like that there is a bug, but it may also be me that made something wrong.

I noticed when running Humann that my genefamilies files contains lines which contain sometimes already names for the uniref50 identifiers, but sometimes not.

The original humann output contains the uniref50 identifier followed by a colon β€œ:”, space and the identifier name, followed by tab and a number.
But in some lines, there is not the colon and the identifier name.

I was expecting to find some name in the result genefamilies file. It was not there.
But, when i searched for the uniref identifier, I could find the entry in the file.

So it seems humann does add for some entries automatically an identifier name, but for some not.

I tried to add names to the genefamilies output using humann_rename_table.
I was expecting that the humann/tools/ can cope with the input file, but it does not.
The rename script loads the full utility mapping database for uniref50 (utility_mapping/map_uniref50_name.txt.bz2).
I extracted that file and found the text file map_uniref50_name.txt that maps uniref identifiers to a name (e.g. UniRef50_A0A1B1LSB2 Shufflon protein D).
The problem is however that for some lines there is way to join the unifier identifier with the identifier in the mapping file.
humann_rename_table -i Sample-A1B_minlen70_genefamilies_fixed.tsv -o Sample-A1B_minlen70_genefamilies_fixed_named.tsv -n uniref50

I added some print statements to to find out that the following line in the script may cause the problem.
allowed_keys = {k.split( util.c_strat_delim )[0]:1 for k in table.rowheads}
because it assumes that there are no β€œ:” in the gene families output tsv.
The polymap is tried to be constructed using a uniref-identifer : uniref-name, as a result the new name cannot be inferred from the old name.

Example output from humann in genefamilies.tsv:

# Gene Family	Sample-A1B_minlen70_Abundance-RPKs
UNMAPPED	666495.0000000000
UniRef50_G0JZW7: Plasmid maintenance system killer	97.4691404486
UniRef50_G0JZW7: Plasmid maintenance system killer|g__Nitrosomonas.s__Nitrosomonas_europaea	97.4691404486
UniRef50_Q04508	97.1632663387
UniRef50_Q04508|g__Nitrosomonas.s__Nitrosomonas_europaea	59.5841332350
UniRef50_Q04508|g__Nitrosomonas.s__Nitrosomonas_eutropha	27.6030402455
UniRef50_Q04508|unclassified	9.9760928582

Not there are some uniref entries that have a β€œ:” followed by a identifier.
Strangely, both uniref50 identifiers can be found in the utility mapping database, though.

The result is this from humann_rename_table:

# Gene Family	Sample-A1B_minlen70_Abundance-RPKs
UNMAPPED	666495.0000000000
UniRef50_G0JZW7: NO_NAME	97.4691404486
UniRef50_G0JZW7: NO_NAME|g__Nitrosomonas.s__Nitrosomonas_europaea	97.4691404486
UniRef50_Q04508: Ammonia monooxygenase beta subunit	97.1632663387
UniRef50_Q04508: Ammonia monooxygenase beta subunit|g__Nitrosomonas.s__Nitrosomonas_europaea	59.5841332350
UniRef50_Q04508: Ammonia monooxygenase beta subunit|g__Nitrosomonas.s__Nitrosomonas_eutropha	27.6030402455
UniRef50_Q04508: Ammonia monooxygenase beta subunit|unclassified	9.9760928582

You can see existing names have been replaced, because the polymap construction could not utilize the combination of uniref-id and name.
the variable in the rename_table script was:
allowed_keys {'UNMAPPED': 1, 'UniRef50_G0JZW7: Plasmid maintenance system killer': 1, 'UniRef50_Q04508': 1}
which gets passed to the polymap construction function:
polymap = util.load_polymap( c_default_names[args.names].path, allowed_keys=allowed_keys )

I found the reason, why in my case there are some uniref50 identifiers that do not have a name in the humann genefamilies output.
It seems that humann used the pip package uniref name mapping file in the data/misc folder:
I have downloaded the pypi humann 3.6 package from:

When extracting the map_uniref50_name.txt.bz2 to map_uniref50_name.txt, I found that this mapping file simply does not contain the same values as the mapping file in the full utility_mapping database.
The txt mapping file on pip has around 286,2 MB. Whereas the map_uniref50_name.txt has in the full utility mapping database around 581 MB.

I can understand, that somehow the incorrect uniref50 mapping file was chosen when generating my humann gene families output.
However, at the time of running humann, my humann_config file contained the path to the full utility_mapping database (outside of the conda environment, on a separate location) and the folder was not empty.
This is strange.

I would expect that humann uses the full utility mapping database when generating the genefamilies file.
Running humann took already very long and the result is disappointing now, because the wrong names have been added.

I checked the source code of humann further and found that calls function gene_families.

This function then uses some config.gene_family_name_mapping_file:

And it turns out that the config.gene_family_name_mapping_file file is always picked from humann’s install directory in data/misc.

It seems that humann tries to resolve the uniref identifiers with a name, but it always uses the mapping file in the misc/data folder. As a described the pip humann package version 3.6 uses an outdated version of uniref50 id to name mappings, when compared to the full utility-mapping db.
But this issue seems only relevant for uniref50, because uniref90 id’s are never found in the uniref50 mapping file. Thus, the uniref90 output always only contains uniref-ids instead of names.

Best regards.

1 Like

As a temporary workaround for this issue, I propose to remove the outdated invalid Uniref50 names that were added for some rows, by using the following sed command on the original genefamilies.tsv output of humann.

sed 's/: .*|/|/;s/: .*\t/\t/' "${originalHumannGeneFamiliesTsvFile}" > "${humannGeneFamiliesTsvFileWithoutUnirefNames}"

It removes the colon ( : ) followed by a space and the Uniref50 name until the pipe symbol ( | ) and replaces it with a pipe. Afterwards, it replaces the colon followed by a space and the Uniref50 name until the TAB symbol with a TAB in the originalHumannGeneFamiliesTsvFile and saves the output in a new file humannGeneFamiliesTsvFileWithoutUnirefNames.

After removing the outdated Uniref50 names from the genefamilies file, one can run humann_rename_table to obtain genefamilies.tsv with Uniref50 names from the mapping file in the full utility_mapping database.

humann_rename_table --names uniref50 -i "${humannGeneFamiliesTsvFileWithoutUnirefNames}"` -o "${outputGeneFamiliesTsvWithNames}"

The script is executed when running humann_rename_table and one can see that it uses by default the utility_mapping database in the humann config (humann_config --print), which in my case contains the path to the full utility_mapping db.

Thanks for pointing this out - we’ll look into it. In the meantime you should be able to manually point the renaming script at the proper file.