Biobakery Workflow database

Hi, so I’m having bit of an issue using biobakery workflow. The main issue is that the database (for wgmx) is stored in a different path due to memory related reasons. How do I make biobakery workflow use the database stored elsewhere?

Hello, You can set environment variables to point to databases that are in a custom location.

If all databases are under one custom folder set the variable: $BIOBAKERY_WORKFLOWS_DATABASES .

Alternatively if you have multiple custom folders, then set one or more of the environment variables:
$KNEADDATA_DB_HUMAN_GENOME, $KNEADDATA_DB_RIBOSOMAL_RNA, $KNEADDATA_DB_HUMAN_TRANSCRIPTOME, $STRAINPHLAN_DB_REFERENCE, and $STRAINPHLAN_DB_MARKERS.

If your HUMAnN databases are stored in a custom folder the HUMAnN configuration file gets updated when you select the folder and download the database through the HUMAnN database tool so there is nothing you would need to do. If your MetaPhlAn database is installed in a custom folder you would need to provide that location as a --metaphlan-option with the --index option when running the workflow.

Please post if you have additional questions or are still running into issues.

Thanks!
Lauren

Hi Lauren,
Just a followup to this.

Does running the below code install all needed and updated databases (ie if we were to run Metaphlan 4 through the workflow and wanted the updated databases)?
$ biobakery_workflows_databases --install wmgx ’

Or do we need to download the individual databases and provide location?

Hi Arya, Yes that should install all the required databases for the wmgx workflow except MetaPhlAn. The MetaPhlAn database will be installed the first time you run it. Alternatively, you can install that database by running $ metaphlan --install.

Thanks!
Lauren

Hi Lauren,

Following up on this thread — I manually downloaded the databases and organized them under a single parent directory (biobakery_workflows_databases/) with the following structure:

I’m planning to export the $BIOBAKERY_WORKFLOWS_DATABASES environment variable pointing to this parent directory. For MetaPhlAn I have multiple database versions (v20 and vJan21). For KneadData, I’m using a mouse reference (mouse_C57BL) rather than the human genome and it’s in it’s subdirectory.

My question:

Is this folder structure what the workflow expects when using $BIOBAKERY_WORKFLOWS_DATABASES, or do the subdirectory names need to match a specific convention? Or in this case the only way is to export different environment variables?

Thanks!