Humann3 creating many bowtie2 index files in the temp dir

Hi there,

I have been running Humann3 and all seems to have been working great until about a week ago (who knows what the naughty coding fairies must have done :man_fairy:).

Humann 3 is running, but now takes super long and generates multiple Bowtie 2 index files. This is how I have been running the Humann portion of my batch jobs:

humann --protein-database /projects/emye7956/software/anaconda/envs/humann_env/uniref \
--nucleotide-database /projects/emye7956/software/anaconda/envs/humann_env/chocophlan/ \
--input "$fpathc" \
--output "$output_dir" -v && echo "ALL DONE WITH ${foutput} AT LAST :D"

The metaphlan databases I have are in /projects/emye7956/software/anaconda/envs/humann_env/lib/python3.7/site-packages/metaphlan/metaphlan_databases
And look like this:

mpa_latest                              mpa_vOct22_CHOCOPhlAnSGB_202212.pkl
mpa_vOct22_CHOCOPhlAnSGB_202212.1.bt2l  mpa_vOct22_CHOCOPhlAnSGB_202212.rev.1.bt2l
mpa_vOct22_CHOCOPhlAnSGB_202212.2.bt2l  mpa_vOct22_CHOCOPhlAnSGB_202212.rev.2.bt2l
mpa_vOct22_CHOCOPhlAnSGB_202212.3.bt2l  mpa_vOct22_CHOCOPhlAnSGB_202212_VINFO.csv
mpa_vOct22_CHOCOPhlAnSGB_202212.4.bt2l  README.txt
mpa_vOct22_CHOCOPhlAnSGB_202212.fna

An example of an output temp dir for a file that ran to completion but took half a day looks like this (note the multiple bowtie2 index files that take long to run):

MG773_humann_temp:
MG773_bowtie2_aligned.sam
MG773_bowtie2_aligned.tsv
MG773_bowtie2_index.1.bt2
MG773_bowtie2_index.2.bt2
MG773_bowtie2_index.3.bt2
MG773_bowtie2_index.4.bt2
MG773_bowtie2_index.rev.1.bt2
MG773_bowtie2_index.rev.2.bt2
MG773_custom_chocophlan_database.ffn
MG773_cleancombined.log
MG773_metaphlan_bowtie2.txt
MG773_metaphlan_bugs_list.tsv

And my config file looks like this:

[database_folders]
nucleotide = data/chocophlan_DEMO
protein = data/uniref_DEMO
utility_mapping = data/misc

[run_modes]
resume = True
verbose = False
bypass_prescreen = False
bypass_nucleotide_index = False
bypass_nucleotide_search = False
bypass_translated_search = False
threads = 40

[alignment_settings]
evalue_threshold = 1.0
prescreen_threshold = 0.01
translated_subject_coverage_threshold = 50.0
translated_query_coverage_threshold = 90.0
nucleotide_subject_coverage_threshold = 50.0
nucleotide_query_coverage_threshold = 90.0

[output_format]
output_max_decimals = 10
remove_stratified_output = False
remove_column_description_output = False

Any help would be very much appreciated!
Thanks so much in advance :slight_smile:

The multiple bowtie2 index files is normal (the index is split over 6 files). Samples that have a lower percentage of reads mapped to known pangenomes (i.e. do more work in translated search) will take longer to run, as will larger samples. Maybe this sample was just “weird” in one of those senses?