Uniref50s in humann2 output when using "--search-mode uniref90"

I am running a number of shotgun metagenomic samples through a custom pipeline that includes humann2 - the command for the primary humann2 run is

humann2 --input ${sample}.fastq.gz --taxonomic-profile ${sample}_profile.tsv --output $humann2_output --threads 8 --remove-temp-output --search-mode uniref90 --output-basename $sample

I have other steps in the workflow to regroup and rename, but in all of my initial ${sample}_genefamilies.tsv outputs, I have quite a few UniRef50 rows, including some that have names (eg UniRef50_K1TBF9: Transposase (Fragment) 1059.7162421954).

A typical file has ~400k rows, ~50k of which are uniref50s and half of the uniref50s have names.

These persist through humann2_renorm_table, and then when I do humann2_rename_table (expecting uniref90s), they’re all converted to eg UniRef50_K1TBF9: NO_NAME.

Is there something about my database configuration that might be causing this?

$ humann2_config
HUMAnN2 Configuration ( Section : Name = Value )
output_format : remove_stratified_output = False
output_format : output_max_decimals = 10
output_format : remove_column_description_output = False
alignment_settings : prescreen_threshold = 0.01
alignment_settings : translated_query_coverage_threshold = 90.0
alignment_settings : evalue_threshold = 1.0
alignment_settings : translated_subject_coverage_threshold = 50.0
database_folders : utility_mapping = /pool001/vklepacc/databases/utility_mapping/
database_folders : protein = /pool001/vklepacc/databases/uniref/
database_folders : nucleotide = /pool001/vklepacc/databases/chocophlan/
run_modes : bypass_nucleotide_search = False
run_modes : verbose = False
run_modes : resume = False
run_modes : bypass_translated_search = False
run_modes : bypass_nucleotide_index = False
run_modes : threads = 1
run_modes : bypass_prescreen = False
$ humann2_databases
HUMANnN2 Databases ( database : build = location )
utility_mapping : full = http://huttenhower.sph.harvard.edu/humann2_data/full_mapping_1_1.tar.gz
chocophlan : DEMO = http://huttenhower.sph.harvard.edu/humann2_data/chocophlan/DEMO_chocophlan.v0.1.1.tar.gz
chocophlan : full = http://huttenhower.sph.harvard.edu/humann2_data/chocophlan/full_chocophlan_plus_viral.v0.1.1.tar.gz
uniref : DEMO_diamond = http://huttenhower.sph.harvard.edu/humann2_data/uniprot/uniref_annotated/uniref90_DEMO_diamond.tar.gz
uniref : uniref90_diamond = http://huttenhower.sph.harvard.edu/humann2_data/uniprot/uniref_annotated/uniref90_annotated_1_1.tar.gz
uniref : uniref50_ec_filtered_diamond = http://huttenhower.sph.harvard.edu/humann2_data/uniprot/uniref_ec_filtered/uniref50_ec_filtered_1_1.tar.gz
uniref : uniref50_GO_filtered_rapsearch2 = http://huttenhower.sph.harvard.edu/humann2_data/uniprot/uniref50_GO_filtered/uniref50_GO_filtered_rapsearch2.tar.gz
uniref : uniref50_diamond = http://huttenhower.sph.harvard.edu/humann2_data/uniprot/uniref_annotated/uniref50_annotated_1_1.tar.gz
uniref : uniref90_ec_filtered_diamond = http://huttenhower.sph.harvard.edu/humann2_data/uniprot/uniref_ec_filtered/uniref90_ec_filtered_1_1.tar.gz


This thread from the old google group seems to have the answer:

If pointed at a folder with more than one database, HUMAnN2 will perform separate searches against each one and merge the results (this allows a user to break up a large database, e.g. one that is too big to fit in memory, and search it serially). That is what you are seeing here.

If you store the UniRef50 and UniRef90 databases in separate folders you will not have this issue. To tell HUMAnN2 which database to use, you can point you individual runs at a folder with the --protein-database flag (OR) you can configure a default translated search database with the humann2_config utility:


For example:

$ humann2_config --update database_folders protein $DIR

Would update the default protein database to $DIR.