Humann uses outdated uniref50 mapping file for mapping Uniref50 ids to names by default

Bernhard · January 21, 2023, 8:25pm

Hi,
I was running humann with the Uniref50 database.
I noticed some weird behavior of Humann. It looks like that there is a bug, but it may also be me that made something wrong.

I noticed when running Humann that my genefamilies files contains lines which contain sometimes already names for the uniref50 identifiers, but sometimes not.

The original humann output contains the uniref50 identifier followed by a colon “:”, space and the identifier name, followed by tab and a number.
But in some lines, there is not the colon and the identifier name.

I was expecting to find some name in the result genefamilies file. It was not there.
But, when i searched for the uniref identifier, I could find the entry in the file.

So it seems humann does add for some entries automatically an identifier name, but for some not.

I tried to add names to the genefamilies output using humann_rename_table.
I was expecting that the humann/tools/rename_table.py can cope with the input file, but it does not.
The rename script loads the full utility mapping database for uniref50 (utility_mapping/map_uniref50_name.txt.bz2).
I extracted that file and found the text file map_uniref50_name.txt that maps uniref identifiers to a name (e.g. UniRef50_A0A1B1LSB2 Shufflon protein D).
The problem is however that for some lines there is way to join the unifier identifier with the identifier in the mapping file.
humann_rename_table -i Sample-A1B_minlen70_genefamilies_fixed.tsv -o Sample-A1B_minlen70_genefamilies_fixed_named.tsv -n uniref50

I added some print statements to rename_table.py to find out that the following line in the script may cause the problem.
allowed_keys = {k.split( util.c_strat_delim )[0]:1 for k in table.rowheads}
because it assumes that there are no “:” in the gene families output tsv.
The polymap is tried to be constructed using a uniref-identifer : uniref-name, as a result the new name cannot be inferred from the old name.

Example output from humann in genefamilies.tsv:

# Gene Family	Sample-A1B_minlen70_Abundance-RPKs
UNMAPPED	666495.0000000000
UniRef50_G0JZW7: Plasmid maintenance system killer	97.4691404486
UniRef50_G0JZW7: Plasmid maintenance system killer|g__Nitrosomonas.s__Nitrosomonas_europaea	97.4691404486
UniRef50_Q04508	97.1632663387
UniRef50_Q04508|g__Nitrosomonas.s__Nitrosomonas_europaea	59.5841332350
UniRef50_Q04508|g__Nitrosomonas.s__Nitrosomonas_eutropha	27.6030402455
UniRef50_Q04508|unclassified	9.9760928582

Not there are some uniref entries that have a “:” followed by a identifier.
Strangely, both uniref50 identifiers can be found in the utility mapping database, though.

The result is this from humann_rename_table:

# Gene Family	Sample-A1B_minlen70_Abundance-RPKs
UNMAPPED	666495.0000000000
UniRef50_G0JZW7: NO_NAME	97.4691404486
UniRef50_G0JZW7: NO_NAME|g__Nitrosomonas.s__Nitrosomonas_europaea	97.4691404486
UniRef50_Q04508: Ammonia monooxygenase beta subunit	97.1632663387
UniRef50_Q04508: Ammonia monooxygenase beta subunit|g__Nitrosomonas.s__Nitrosomonas_europaea	59.5841332350
UniRef50_Q04508: Ammonia monooxygenase beta subunit|g__Nitrosomonas.s__Nitrosomonas_eutropha	27.6030402455
UniRef50_Q04508: Ammonia monooxygenase beta subunit|unclassified	9.9760928582

You can see existing names have been replaced, because the polymap construction could not utilize the combination of uniref-id and name.
the variable in the rename_table script was:
allowed_keys {'UNMAPPED': 1, 'UniRef50_G0JZW7: Plasmid maintenance system killer': 1, 'UniRef50_Q04508': 1}
which gets passed to the polymap construction function:
polymap = util.load_polymap( c_default_names[args.names].path, allowed_keys=allowed_keys )

I found the reason, why in my case there are some uniref50 identifiers that do not have a name in the humann genefamilies output.
It seems that humann used the pip package uniref name mapping file in the data/misc folder:
~/miniconda3/envs/humann3.6_metaphlan4_py3.9/lib/python3.9/site-packages/humann/data/misc/map_uniref50_name.txt.bz2.
I have downloaded the pypi humann 3.6 package from:

When extracting the map_uniref50_name.txt.bz2 to map_uniref50_name.txt, I found that this mapping file simply does not contain the same values as the mapping file in the full utility_mapping database.
The txt mapping file on pip has around 286,2 MB. Whereas the map_uniref50_name.txt has in the full utility mapping database around 581 MB.

I can understand, that somehow the incorrect uniref50 mapping file was chosen when generating my humann gene families output.
However, at the time of running humann, my humann_config file contained the path to the full utility_mapping database (outside of the conda environment, on a separate location) and the folder was not empty.
This is strange.

I would expect that humann uses the full utility mapping database when generating the genefamilies file.
Running humann took already very long and the result is disappointing now, because the wrong names have been added.

I checked the source code of humann further and found that humann.py calls families.py function gene_families.

github.com

biobakery/humann/blob/3.6/humann/humann.py#L1110


      
          unaligned_reads_store.clear()
              
          # Compute or load in gene families
          output_files=[]
          if args.input_format in ["fasta","fastq","sam","blastm8"]:
              # Compute the gene families
              message="Computing gene families ..."
              logger.info(message)
              print("\n"+message)
              
              families_file=families.gene_families(alignments,gene_scores,unaligned_reads_count)
              output_files.append(families_file)
          
          
    start_time=timestamp_message("computing gene families",start_time)
          
          
elif args.input_format in ["genetable"]:
              # Load the gene scores
              message="Process the gene table ..."
              logger.info(message)
              print("\n"+message)

This function then uses some config.gene_family_name_mapping_file:
gene_names=store.Names(config.gene_family_name_mapping_file)

github.com

biobakery/humann/blob/3.6/humann/quantify/families.py#L48


      
          """
          Compute the gene families from the alignments
          """
          
          
logger.debug("Compute gene families")
          
          
# Compute scores for each gene family for each bug set
          alignments.convert_alignments_to_gene_scores(gene_scores)
              
          # Process the gene id to names mappings
          gene_names=store.Names(config.gene_family_name_mapping_file)
           
          delimiter=config.output_file_column_delimiter
          category_delimiter=config.output_file_category_delimiter     
          
          
# Write the scores ordered with the top first
          column_name=config.file_basename+"_Abundance-RPKs"
          if config.remove_column_description_output:
              column_name=config.file_basename
          tsv_output=["# Gene Family"+delimiter+column_name]

And it turns out that the config.gene_family_name_mapping_file file is always picked from humann’s install directory in data/misc.
gene_family_name_mapping_file=os.path.abspath(os.path.join(humann_install_directory,"data","misc","map_uniref50_name.txt.bz2"))

github.com

biobakery/humann/blob/3.6/humann/config.py#L282


      
          
          
# pathways files
          humann_install_directory=os.path.dirname(os.path.abspath(__file__))
          metacyc_gene_to_reactions=os.path.abspath(os.path.join(humann_install_directory,"data","pathways","metacyc_reactions_level4ec_only.uniref.bz2"))
          metacyc_reactions_to_pathways=os.path.abspath(os.path.join(humann_install_directory,"data","pathways","metacyc_pathways_structured_filtered_v24"))
              
          unipathway_database_part1=os.path.abspath(os.path.join(humann_install_directory,"data","pathways","unipathway_uniprots.uniref.bz2"))
          unipathway_database_part2=os.path.abspath(os.path.join(humann_install_directory,"data","pathways","unipathway_pathways"))
          
          
# pathways and gene families name mapping files
          gene_family_name_mapping_file=os.path.abspath(os.path.join(humann_install_directory,"data","misc","map_uniref50_name.txt.bz2"))
          pathway_name_mapping_file=os.path.abspath(os.path.join(humann_install_directory,"data","misc","map_metacyc-pwy_name.txt.gz"))
          name_mapping_file_delimiter="\t"
          name_mapping_join=": "
          
          
# pathways database selection
          pathways_database_choices=["metacyc","unipathway"]
          pathways_database=pathways_database_choices[0]
          
          
# selected pathways
          pathways_database_part1=metacyc_gene_to_reactions

It seems that humann tries to resolve the uniref identifiers with a name, but it always uses the mapping file in the misc/data folder. As a described the pip humann package version 3.6 uses an outdated version of uniref50 id to name mappings, when compared to the full utility-mapping db.
But this issue seems only relevant for uniref50, because uniref90 id’s are never found in the uniref50 mapping file. Thus, the uniref90 output always only contains uniref-ids instead of names.

Best regards.

Bernhard · January 22, 2023, 2:13pm

As a temporary workaround for this issue, I propose to remove the outdated invalid Uniref50 names that were added for some rows, by using the following sed command on the original genefamilies.tsv output of humann.

sed 's/: .*|/|/;s/: .*\t/\t/' "${originalHumannGeneFamiliesTsvFile}" > "${humannGeneFamiliesTsvFileWithoutUnirefNames}"

It removes the colon ( : ) followed by a space and the Uniref50 name until the pipe symbol ( | ) and replaces it with a pipe. Afterwards, it replaces the colon followed by a space and the Uniref50 name until the TAB symbol with a TAB in the originalHumannGeneFamiliesTsvFile and saves the output in a new file humannGeneFamiliesTsvFileWithoutUnirefNames.

After removing the outdated Uniref50 names from the genefamilies file, one can run humann_rename_table to obtain genefamilies.tsv with Uniref50 names from the mapping file in the full utility_mapping database.

humann_rename_table --names uniref50 -i "${humannGeneFamiliesTsvFileWithoutUnirefNames}"` -o "${outputGeneFamiliesTsvWithNames}"

The script rename_table.py is executed when running humann_rename_table and one can see that it uses by default the utility_mapping database in the humann config (humann_config --print), which in my case contains the path to the full utility_mapping db.

github.com

biobakery/humann/blob/3.6/humann/tools/rename_table.py#L60


      
          
          
# get a list of all available script mapping files
          try:
              all_mapping_files=os.listdir(config.utility_mapping_database)
          except EnvironmentError:
              all_mapping_files=[]
          
          
# add the options for the larger mapping files if they are present
          larger_mapping_files_found=False
          if "map_uniref50_name.txt.bz2" in all_mapping_files:
              c_default_names["uniref50"]=Names(os.path.join(config.utility_mapping_database,"map_uniref50_name.txt.bz2"))
              larger_mapping_files_found=True
          if "map_uniref90_name.txt.bz2" in all_mapping_files:
              c_default_names["uniref90"]=Names(os.path.join(config.utility_mapping_database,"map_uniref90_name.txt.bz2"))
              larger_mapping_files_found=True
              
          if not larger_mapping_files_found:
              description+="""
          
          
For additional name mapping files, run the following command:
          $ humann_databases --download utility_mapping full $DIR

franzosa · January 24, 2023, 6:40pm

Thanks for pointing this out - we’ll look into it. In the meantime you should be able to manually point the renaming script at the proper file.

Topic		Replies	Views
UniRef90 to UniRef50 conversion using HUMAnN3.0 HUMAnN	1	205	October 20, 2023
Help with understanding identifier mappings for HUMAnN? HUMAnN	2	120	May 3, 2024
Uniref50s in humann2 output when using "--search-mode uniref90" HUMAnN	1	824	November 16, 2019
No_name after humann_rename_table HUMAnN	1	29	October 30, 2024
Custom UniRef90 database with Humann3 HUMAnN	4	811	March 15, 2021

Humann uses outdated uniref50 mapping file for mapping Uniref50 ids to names by default

Related topics