Biobakery metagenomics memory storage requirements for running Biobakery wgmx pipeline

Hello Biobakery admins and users,

First time using the Biobakery pipeline for my metagenomics data. I have 160 fastaq.gz files (pair-end) for 80 samples. Each file is about 8 GB. I have 3TB of memory space available in my cluster lab directory to run the pipeline; the raw data itself is taking about ~1.3 TB of space, so I have ~1.7 TB of storage space. I ran the script I wrote for the Biobakery pipeline, but after running for 13 hrs, the script timed out due to it running out of disk space (when I checked the amount of memory available it said 0 TB). I wanted to ask if there was a way to determine, ahead of time, how much memory I’ll need to run the pipeline so that I can request an increase.

Thank you. :slight_smile:

Update, I’ve added the option “–remove-intermediate-output” into my script. I’d still like an answer to my question, but figured adding this option to my command should circumvent the issue of running out of disk space.

Update, despite removing intermediate files I ran out of space. Before I request more storage space from our cluster, can someone provide an answer to my question?

Hi @Mak0130 , Yes, that is a great option to add. It will reduce the overall disk space needed a lot. In general while running the workflows I would estimate you would need about 3x the amount of disk space for intermediate files. So if you have ~1.2 Tb of input files I would estimate you would need about 3.8 Tb of disk space overall if you were to run all of your inputs files at once. Since you have 3 Tb in total of disk space I would suggest running your input files in batches to split up your runs into 3 or 4 sets. You can do this if you are using the biobakery workflows by adding the option --local-jobs N or --grid-jobs N and only N jobs will run at once. Please post if you run into any additional issues!

Thanks!
Lauren

Thanks for the reply! I am running all my files at once, the “Kneaddata” part of the pipeline finished this morning, and it took 4.2 TB of memory space. Luckily I requested an increase in my memory storage space to about 7 TB. Currently using 5.6 TB of the 7 TB. Hopefully the 1.4 TB left is enough for the remaining portion. Will update when it’s finished running. If I run out of disk space again, I’ll split the analysis into batches. Thanks!