Forwarding a question from a user:
In a nutshell, I am trying to figure out what is the best way of optimizing my run. I can successfully preprocess 1 sample but it takes ~2 hrs and 20 min with threads=60. I have 200 samples, which would take about 2 weeks to preprocess (and then there are 2 more steps).
- FYI, samtools sorting and indexing takes ~60 min to run on a single sample; In the .mapping.stdout.log file I don’t see
--threads
appended to sort/index like this:samtools sort --threads INT
,samtools index --threads INT
Header of my submission file is below along with the metawibele command:
#!/usr/bin/env bash
#$ -e ./logs/
#$ -o ./logs/
#$ -S /bin/bash
#$ -l mem_free=64G
#$ -pe smp 60
metawibele preprocess --input $INPUT --output $OUTPUT --output-basename ${OUTPUT_BASENAME} --extension-paired ${EXTENSION_PAIRED} --extension ".fastq.gz" --config $CONFIG [--threads THREADS] [--local-jobs JOBS] [--grid-jobs GRID_JOBS] [--grid sge] [--grid-partition <this would be `q` on our SGE cluster>]
Here is what I noticed when running metawibele preprocess with or without the grid options:
-
When no grid options are set (default settings and with threads=60) the preprocess step works fine but only when running a single sample. It crashes when there are 200 samples inside the $INPUT folder and it produces files like: core.32134
-
When running a single sample with grid options set (–grid-jobs 60 --grid sge), it runs for 2 hours but nothing is saved once the run is over. Not a single bit of information, not even inside the usual .e and .o log files.
-
When I run the “dry run” (–grid-jobs 60 --grid sge --dry-run) everything is output as it should be: the (empty) output folders are created, and a log file with the commands that would normally be executed is created.
I could use job arrays to process the samples individually, but then I am not sure at the moment how to go about the merging step.
I am using the following version:
$ metawibele preprocess --version
preprocess.py v0.4.5