Biobakery Workflows: Errors with downloading database, "unpicking error", “topological_sort()" error, and "Unable to find Trimmomatic" error

Hi there! I hope you are well, and hope you might be able to help me with some errors. I am trying to install and run biobakery_workflows, but there are two problems that I’m facing.

Context (How I installed biobakery_workflows):

This is how I installed the biobakery workflow:

conda create -n biobakerywf -c biobakery biobakery_workflows

Problem 1: Trouble downloading and installing databases with biobakery_workflows_databases

When I first tried to install the databases with biobakery_workflows_databases, I received a tbb error:

error while loading shared libraries: libtbb.so.2: cannot open shared object file: No such file or directory
(ERR): Description of arguments failed!

This was not too big of a deal, because it’s a known issue with bowtie2. I resolved this by downgrading tbb.:

conda install tbb=2020.2

After these changes, I again tried to install databases with:

biobakery_workflows_databases --install wmgx --location /home/bsingh/bin/biobakery_databases

I received the following error:

Installing humann utility mapping database
Download URL: http://huttenhower.sph.harvard.edu/humann2_data/full_mapping_v201901.tar.gz
Downloading file of size: 2.55 GB

2.55 GB 100.00 %  10.70 MB/sec  0 min -0 sec         
Extracting: /home/bsingh/bin/biobakery_databases/humann/full_mapping_v201901.tar.gz

Database installed: /home/bsingh/bin/biobakery_databases/humann/utility_mapping

HUMAnN configuration file updated: database_folders : utility_mapping = /home/bsingh/bin/biobakery_databases/humann/utility_mapping
Generating strainphlan fasta database
Could not locate a Bowtie index corresponding to basename "/home/bsingh/miniconda3/envs/biobakerywf/lib/python3.7/site-packages/metaphlan/metaphlan_databases/mpa_v30_CHOCOPhlAn_201901"
Error: Encountered internal Bowtie 2 exception (#1)
Command: /home/bsingh/miniconda3/envs/biobakerywf/bin/bowtie2-inspect-s --wrapper basic-0 /home/bsingh/miniconda3/envs/biobakerywf/lib/python3.7/site-packages/metaphlan/metaphlan_databases/mpa_v30_CHOCOPhlAn_201901 
Unable to install database. Error running command: bowtie2-inspect /home/bsingh/miniconda3/envs/biobakerywf/lib/python3.7/site-packages/metaphlan/metaphlan_databases/mpa_v30_CHOCOPhlAn_201901 > /home/bsingh/bin/biobakery_databases/strainphlan_db_markers/all_markers.fasta

Any help would be appreciated!

Problem 2: "Unpickling Error"

When I try to even do biobakery_workflows wmgx --help, I get the following error:

  File "/home/bsingh/miniconda3/envs/biobakerywf/bin/wmgx.py", line 41, in <module>
    workflow = Workflow(version="0.1", description="A workflow for whole metagenome shotgun sequences")
  File "/home/bsingh/miniconda3/envs/biobakerywf/lib/python3.7/site-packages/anadama2/workflow.py", line 120, in __init__
    self.document=PweaveDocument()
  File "/home/bsingh/miniconda3/envs/biobakerywf/lib/python3.7/site-packages/anadama2/document.py", line 96, in __init__
    self.vars=self.get_vars()
  File "/home/bsingh/miniconda3/envs/biobakerywf/lib/python3.7/site-packages/anadama2/document.py", line 325, in get_vars
    vars = pickle.load(open(pickle_file[0],"rb"))
_pickle.UnpicklingError: pickle data was truncated

I think this error is separate from the database download errors. For this one, I’m truly lost. Once again, any help would be appreciated!

Update: I was able to solve Problem 1 and Problem 2.

Problem 1 Solution: Prior to using biobakery_workflows_databases to download other databases, it’s best to manually download the MetaPhlan database. Somewhere on the MetaPhlan GitHub or tutorial, it says that if you’re downloading with Conda, you should download the databases in a custom location. However, if you do this, the biobakery_workflows_databases doesn’t know where to look for the MetaPhlan databases, and thinks they don’t exist. This leads to the error. So when you download the MetaPhlan databases, do it in the default location inside the Conda file structure.

Problem 2 Solution: This resolved itself one all the databases were downloaded.

Problem 3 Solution: There had been a third problem, where I was getting a “topological_sort()” error. As per this forum, I followed the solution and downgraded the networkx package to version 1.11.

Ultimately, this is what has worked so far to solve problems 1-3:

conda create -n biobakerywf -c biobakery biobakery_workflows
conda install tbb=2020.2
conda install networkx=1.11 
metaphlan --install #do not specify download location
biobakery_workflows_databases --install wmgx #do not specify download location

Problem 4:

However, I unfortunately now have another problem, where KneadData is unable to recognize that Trimmomatic is already downloaded, and I get the following error when I try to run the program:

Task 3 failed
  Name: kneaddata____HD42R4_subsample
  Original error: 
  Error executing action 0. Original Exception: 
  Traceback (most recent call last):
    File "/home/bsingh/miniconda3/envs/biobakerywf/lib/python3.7/site-packages/anadama2/runners.py", line 201, in _run_task_locally
      action_func(task)
    File "/home/bsingh/miniconda3/envs/biobakerywf/lib/python3.7/site-packages/anadama2/helpers.py", line 89, in actually_sh
      ret = _sh(s, **kwargs)
    File "/home/bsingh/miniconda3/envs/biobakerywf/lib/python3.7/site-packages/anadama2/util/__init__.py", line 320, in sh
      raise ShellException(proc.returncode, msg.format(cmd, ret[0], ret[1]))
  anadama2.util.ShellException: [Errno 1] Command `kneaddata --input /home/bsingh/biobakery_test_inputs/HD42R4_subsample.fastq.gz --output /home/bsingh/output_data/kneaddata/main --threads 1 --output-prefix HD42R4_subsample   --reference-db /home/bsingh/biobakery_workflows_databases/kneaddata_db_human_genome  --serial --run-trf  && mv /home/bsingh/output_data/kneaddata/main/HD42R4_subsample.repeats.removed.fastq /home/bsingh/output_data/kneaddata/main/HD42R4_subsample.fastq' failed. 
  Out: b''
  Err: b'ERROR: Unable to find trimmomatic. Please provide the full path to trimmomatic with --trimmomatic.\n

This same error was previously observed in this forum post, and also in this GitHub issue. A similar problem with trf was observed here.

As per the links above, I figured that the solution was to just specify the Trimmomatic path with --trimmomatic when running KneadData. However, since I’m using biobakery_workflows instead of just KneadData, I don’t think there is an explicit option to do that?

Problem 5:

I tried using KneadData by itself to make sure that there weren’t any other issues, but even when I specify with the Trimmomatic path, I get this error, which seems to generate from here, in line 278:

Decompressing gzipped file ...
Critical Error: Unable to gunzip input file: /home/bsingh/biobakery_test_inputs/HD42R4_subsample.fastq.gz
2 Likes

I’m sorry for the multiple replies! As I’m finding solutions, I thought it’s better to just post it here in case anyone else finds it useful.

Problem 4 Solution: Fixed the “Unable to find trimmomatic” error. There is a gem of an option called --qc-options, where I was able to put the Trimmomatic path! Yay!

biobakery_workflows wmgx --input /home/bsingh/biobakery_test_inputs/ --output output_data --bypass-strain-profiling --qc-options "--trimmomatic /home/bsingh/bin/Trimmomatic-0.36".

Problem 5 Solution: File was corrupted! We’re all good! Thank you!

1 Like

Hi, I’m so sorry for making another post.

Problem 6:

I was able to run a wmgx workflow test with one full-sized paired-end metagenomic sample. However, when I tried wmgx_vis, I got the following error:

ImportError: cannot import name 'PwebProcessor'

I found over here that other people have also had this problem, and that they solved this by changing the Pweave version to 0.25. My version was 0.30.2. So I decided to do the same thing, and re-created my environment like this:

conda create -c biobakery -n biobakery python=3.6 biobakery_workflows tbb=2020.2 networkx=1.11 pweave=0.25 python-leveldb

I again tried to run wmgx_vis with the following command:

biobakery_workflows wmgx_vis --input /home/bsingh/output_data/ --project-name JSA10 --output output_vis_flow --format pdf

And I got this error in the log file:

  File "/home/bsingh/miniconda3/envs/biobakerywf/lib/python3.6/site-packages/anadama2/runners.py", line 201, in _run_task_locally
    action_func(task)
  File "/home/bsingh/miniconda3/envs/biobakerywf/lib/python3.6/site-packages/anadama2/document.py", line 286, in create
    doc.weave(shell=PwebProcessorSpaces)
  File "/home/bsingh/miniconda3/envs/biobakerywf/lib/python3.6/site-packages/pweave/pweb.py", line 198, in weave
    self.run(shell)
  File "/home/bsingh/miniconda3/envs/biobakerywf/lib/python3.6/site-packages/pweave/pweb.py", line 149, in run
    runner.run()
  File "/home/bsingh/miniconda3/envs/biobakerywf/lib/python3.6/site-packages/pweave/processors.py", line 65, in run
    self.executed = list(map(self._runcode, self.parsed))
  File "/home/bsingh/miniconda3/envs/biobakerywf/lib/python3.6/site-packages/pweave/processors.py", line 131, in _runcode
    chunk['content'] = self.loadinline(chunk['content'])
  File "/home/bsingh/miniconda3/envs/biobakerywf/lib/python3.6/site-packages/anadama2/document.py", line 234, in loadinline
    result = self.loadstring(code_str).lstrip().replace("\n","",1)
  File "/home/bsingh/miniconda3/envs/biobakerywf/lib/python3.6/site-packages/pweave/processors.py", line 296, in loadstring
    exec(compiled, scope)
  File "chunk", line 1, in <module>
NameError: name 'caption' is not defined

The problem seems to be with Pweave again. I’m not sure which version I should be using?

Update: Fixed problem 6! Seemed to be more dependancy issues .

Problem 6 Solution:

This is the conda env that has worked me so far for wmgx and wmgx_vis:

conda env export --from-history > biobakerywl.yaml

channels:
  - biobakery
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - tbb=2020.2
  - pweave=0.25
  - python=3.6
  - python-leveldb
  - networkx=1.11
  - biobakery_workflows
  - jupyter_client
  - pandoc
  - hclust2
  - latexcodec
  - r
  - r-vegan
2 Likes

Hello,
Thank you very much for all the details and explainations to resolve the issues. I have the same problem and I will try a new conda install using your yaml file. Do your conda env still work? Or you had to make some update?
Thanks
regards
Nicolas

So I couldn’t get this to work with the YAML but this worked for me after investigating the file:

conda create --name biobakery_p3.6 python=3.6
conda activate biobakery_p3.6
conda install -c biobakery -c conda-forge -c bioconda pweave=0.25
conda install -c biobakery -c conda-forge -c bioconda networkx=1.11
conda install -c biobakery -c conda-forge -c bioconda biobakery_workflows
conda install -c biobakery -c conda-forge -c tbb=2020.2 python-leveldb jupyter_client pandoc hclust2 latexcodec r r-vegan

Now I get errors which seem to originate from matplotlib:

 Exception:
  'dict_keys' object has no attribute 'remove'
  Error messages will be included in output document
Traceback (most recent call last):
  File "/usr/local/bin/miniconda3/envs/biobakery_p3.6/bin/wmgx_vis.py", line 133, in <module>
    workflow.go()
  File "/usr/local/bin/miniconda3/envs/biobakery_p3.6/lib/python3.6/site-packages/anadama2/workflow.py", line 800, in go
    self._handle_finished()
  File "/usr/local/bin/miniconda3/envs/biobakery_p3.6/lib/python3.6/site-packages/anadama2/workflow.py", line 832, in _handle_finished
    raise RunFailed()
anadama2.workflow.RunFailed

Any help, pointers or anything would all be very appreciated :).