FWD: Optimizing MetaWIBELE run

Forwarding a question from a user:

In a nutshell, I am trying to figure out what is the best way of optimizing my run. I can successfully preprocess 1 sample but it takes ~2 hrs and 20 min with threads=60. I have 200 samples, which would take about 2 weeks to preprocess (and then there are 2 more steps).

  • FYI, samtools sorting and indexing takes ~60 min to run on a single sample; In the .mapping.stdout.log file I don’t see --threads appended to sort/index like this: samtools sort --threads INT​, samtools index --threads INT

Header of my submission file is below along with the metawibele command:

#!/usr/bin/env bash
#$ -e ./logs/
#$ -o ./logs/
#$ -S /bin/bash
#$ -l mem_free=64G
#$ -pe smp 60
metawibele preprocess --input $INPUT --output $OUTPUT --output-basename ${OUTPUT_BASENAME} --extension-paired ${EXTENSION_PAIRED} --extension ".fastq.gz" --config $CONFIG [--threads THREADS] [--local-jobs JOBS] [--grid-jobs GRID_JOBS] [--grid sge] [--grid-partition <this would be `q` on our SGE cluster>]

Here is what I noticed when running metawibele preprocess with or without the grid options:

  • When no grid options are set (default settings and with threads=60) the preprocess step works fine but only when running a single sample. It crashes when there are 200 samples inside the $INPUT folder and it produces files like: core.32134

  • When running a single sample with grid options set (–grid-jobs 60 --grid sge), it runs for 2 hours but nothing is saved once the run is over. Not a single bit of information, not even inside the usual .e and .o log files.

  • When I run the “dry run” (–grid-jobs 60 --grid sge --dry-run) everything is output as it should be: the (empty) output folders are created, and a log file with the commands that would normally be executed is created.

I could use job arrays to process the samples individually, but then I am not sure at the moment how to go about the merging step.

I am using the following version:

$ metawibele preprocess --version
preprocess.py v0.4.5

Hi there,

Thanks for reaching out. You can definitely take advantage of the parallel options in MetaWIBELE to accelerate your run:

  1. If you run MetaWIBELE without grid options, you can customize the –local-jobs parameter to run multiple samples parallelly on your local machine. For example, when setting with “--local-jobs 20 --threads 10”, MetaWIBELE will launch 20 jobs and simultaneously process 20 samples using 10 threads per sample.
  2. If you run with gird options, typically you need to customize --grid-job (default: 0), --grid (default: slurm), --grid-partition (default: serial_requeue,serial_requeue,240) and --grid-scratch (e.g. --grid-scratch “”). If you didn’t see any output from your run, it seems like your job was not successfully submitted to the grid compute server. It might be possible that some of the grid parameters are not well set. You may need to tweak the setting according to your grid computing environment.
  3. Running with “dry run” just prints tasks to be run but doesn’t execute their actions, which doesn’t tell you whether your jobs are successfully submitted and run on the grid machine.

Meanwhile, thanks for your suggestion on the threads options for samtools. We have integrated more parallel options for samtools’s processing in our latest development version (v0.4.7) which has been released in the GitHub and will be released in conda/pip/dockcer soon.

Thanks!
Yancong