Hi,
I was running humann with the Uniref50 database.
I noticed some weird behavior of Humann. It looks like that there is a bug, but it may also be me that made something wrong.
I noticed when running Humann that my genefamilies files contains lines which contain sometimes already names for the uniref50 identifiers, but sometimes not.
The original humann output contains the uniref50 identifier followed by a colon “:”, space and the identifier name, followed by tab and a number.
But in some lines, there is not the colon and the identifier name.
I was expecting to find some name in the result genefamilies file. It was not there.
But, when i searched for the uniref identifier, I could find the entry in the file.
So it seems humann does add for some entries automatically an identifier name, but for some not.
I tried to add names to the genefamilies output using humann_rename_table
.
I was expecting that the humann/tools/rename_table.py
can cope with the input file, but it does not.
The rename script loads the full utility mapping database for uniref50 (utility_mapping/map_uniref50_name.txt.bz2
).
I extracted that file and found the text file map_uniref50_name.txt
that maps uniref identifiers to a name (e.g. UniRef50_A0A1B1LSB2 Shufflon protein D
).
The problem is however that for some lines there is way to join the unifier identifier with the identifier in the mapping file.
humann_rename_table -i Sample-A1B_minlen70_genefamilies_fixed.tsv -o Sample-A1B_minlen70_genefamilies_fixed_named.tsv -n uniref50
I added some print statements to rename_table.py
to find out that the following line in the script may cause the problem.
allowed_keys = {k.split( util.c_strat_delim )[0]:1 for k in table.rowheads}
because it assumes that there are no “:” in the gene families output tsv.
The polymap is tried to be constructed using a uniref-identifer : uniref-name, as a result the new name cannot be inferred from the old name.
Example output from humann in genefamilies.tsv:
# Gene Family Sample-A1B_minlen70_Abundance-RPKs
UNMAPPED 666495.0000000000
UniRef50_G0JZW7: Plasmid maintenance system killer 97.4691404486
UniRef50_G0JZW7: Plasmid maintenance system killer|g__Nitrosomonas.s__Nitrosomonas_europaea 97.4691404486
UniRef50_Q04508 97.1632663387
UniRef50_Q04508|g__Nitrosomonas.s__Nitrosomonas_europaea 59.5841332350
UniRef50_Q04508|g__Nitrosomonas.s__Nitrosomonas_eutropha 27.6030402455
UniRef50_Q04508|unclassified 9.9760928582
Not there are some uniref entries that have a “:” followed by a identifier.
Strangely, both uniref50 identifiers can be found in the utility mapping database, though.
The result is this from humann_rename_table
:
# Gene Family Sample-A1B_minlen70_Abundance-RPKs
UNMAPPED 666495.0000000000
UniRef50_G0JZW7: NO_NAME 97.4691404486
UniRef50_G0JZW7: NO_NAME|g__Nitrosomonas.s__Nitrosomonas_europaea 97.4691404486
UniRef50_Q04508: Ammonia monooxygenase beta subunit 97.1632663387
UniRef50_Q04508: Ammonia monooxygenase beta subunit|g__Nitrosomonas.s__Nitrosomonas_europaea 59.5841332350
UniRef50_Q04508: Ammonia monooxygenase beta subunit|g__Nitrosomonas.s__Nitrosomonas_eutropha 27.6030402455
UniRef50_Q04508: Ammonia monooxygenase beta subunit|unclassified 9.9760928582
You can see existing names have been replaced, because the polymap construction could not utilize the combination of uniref-id and name.
the variable in the rename_table script was:
allowed_keys {'UNMAPPED': 1, 'UniRef50_G0JZW7: Plasmid maintenance system killer': 1, 'UniRef50_Q04508': 1}
which gets passed to the polymap construction function:
polymap = util.load_polymap( c_default_names[args.names].path, allowed_keys=allowed_keys )
I found the reason, why in my case there are some uniref50 identifiers that do not have a name in the humann genefamilies output.
It seems that humann used the pip package uniref name mapping file in the data/misc folder:
~/miniconda3/envs/humann3.6_metaphlan4_py3.9/lib/python3.9/site-packages/humann/data/misc/map_uniref50_name.txt.bz2
.
I have downloaded the pypi humann 3.6 package from:
When extracting the map_uniref50_name.txt.bz2 to map_uniref50_name.txt, I found that this mapping file simply does not contain the same values as the mapping file in the full utility_mapping database.
The txt mapping file on pip has around 286,2 MB. Whereas the map_uniref50_name.txt has in the full utility mapping database around 581 MB.
I can understand, that somehow the incorrect uniref50 mapping file was chosen when generating my humann gene families output.
However, at the time of running humann, my humann_config
file contained the path to the full utility_mapping database (outside of the conda environment, on a separate location) and the folder was not empty.
This is strange.
I would expect that humann uses the full utility mapping database when generating the genefamilies file.
Running humann took already very long and the result is disappointing now, because the wrong names have been added.
I checked the source code of humann further and found that humann.py calls families.py function gene_families
.
This function then uses some config.gene_family_name_mapping_file
:
gene_names=store.Names(config.gene_family_name_mapping_file)
And it turns out that the config.gene_family_name_mapping_file
file is always picked from humann’s install directory in data/misc.
gene_family_name_mapping_file=os.path.abspath(os.path.join(humann_install_directory,"data","misc","map_uniref50_name.txt.bz2"))
It seems that humann tries to resolve the uniref identifiers with a name, but it always uses the mapping file in the misc/data folder. As a described the pip humann package version 3.6 uses an outdated version of uniref50 id to name mappings, when compared to the full utility-mapping db.
But this issue seems only relevant for uniref50, because uniref90 id’s are never found in the uniref50 mapping file. Thus, the uniref90 output always only contains uniref-ids instead of names.
Best regards.