The bioBakery help forum

Alternate kneaddata genomes in biobakery workflows

It appears like the kneaddata step of the wmgx or wmgx_wmtx workflows can’t be run against non-human databases, is that correct? If I use the --qc-options="–reference-db /location/" to specify the mouse database instead of the human database (like I would using kneaddata) I get an error that says:

"Unable to find database KNEADDATA_DB_HUMAN_GENOME. This is the KneadData bowtie2 database of the human genome. This database can be downloaded with Knead Data. Unable to find in default install folders or with environment variable.

I also get an error if I use a workflows environmental variable to specify the mouse genome instead of using the kneaddata variable.

Is there anyway to run these pipelines using the mouse genome in the KneadData step?

Hello - You should be able to use non-human reference databases with the workflow. We have used this option for our own runs when needed. Sorry to hear you are running into errors. If you add the option --contaminate-databases <folder> when running the workflows you can specify the mouse genome location. Alternatively you should be able to change the environment variable too. Please try adding the database option when running and follow up if you continue to have issues.

Thank you,
Lauren

Hi and thanks for the quick reply. I tried that with the following command:

biobakery_workflows wmgx_wmtx --dry-run --contaminate-databases /kneaddata/databases/mouse/ --threads 48 --input-metagenome /metagenomics/ --input-metatranscriptome /metatranscriptomics/ --output /workflows_output/

Here was my error:

wmgx_wmtx.py: error: unrecognized arguments: --contaminate-databases /kneaddata/databases/mouse/

Hi - Thanks for the follow up and sorry for any confusion. The wmgx workflow is the only one with the --contaminate-databases option. The wmgx_wmtx workflow does not have the database option in part because we set different databases for each input type (wmgx and wmtx). We could add a set of custom database options to the workflow and will look at adding this in a future release. For now you should be able to set the following environment variables to provide custom databases:

$KNEADDATA_DB_HUMAN_GENOME
$KNEADDATA_DB_HUMAN_TRANSCRIPTOME
$KNEADDATA_DB_RIBOSOMAL_RNA

The workflow will run the first database on all DNA samples and all databases on all RNA samples.

Thank you,
Lauren

Hi, I used the following command “biobakery_workflows wmgx --contaminate-databases /apps/users/user01/wanghhh/metagenomic/databases/kneaddata_database --input rawdata --output workflow_output”, but still have the error like “Unable to find database KNEADDATA_DB_HUMAN_GENOME. This is the KneadData bowtie2 database of the human genome. This database can be downloaded with KneadData. Unable to find in default install folders or with environment variables.”
So, how could I fix it? thank you~~

Hello - Thank you for the detailed post. If you set the environment variable $KNEADDATA_DB_HUMAN_GENOME it will resolve the error you are seeing. You can set it to any database it just needs to be set as a part of the initial workflow installation.

Thank you,
Lauren

Hi,Would you tell me how to set the environment variables? for example, I have downloaded the corresponding databases in /lizhihua/biobakery_workflows/kneaddata_db_human_genome. then, $KNEADDATA_DB_HUMAN_GENOME=/lizhihua/biobakery_workflows/kneaddata_db_human_genome? Thank you very much!

Hello - It would depend on your default shell. If you are running in bash you would run:
$ export KNEADDATA_DB_HUMAN_GENOME=/lizhihua/biobakery_workflows/kneaddata_db_human_genome

Thank you,
Lauren