Maaslin3 Memory Consumption during Logistic/Prevalence Abundance

Hello!

I really enjoy working with Maaslin3! I’m currently running it as an R package on around 1000 samples that are longitudinal in nature (ie, have a mixed effect component). I am finding during the logistic/prevalence testing phase that I run out of memory that I cannot recover from.

I am running on HPC and even assigned 100G of memory to my Maaslin3 job and it was killed for running out of memory. If I use a smaller dataset that fills just under my assigned memory limits, I cannot clear out this used memory. I need to fully restart R to be able to do anything memory intensive in the future.

Do you happen to know what might be causing this?

Thank you so much!

Hi!

Can you send the maaslin3 command you’re running? Have you turned on save_models, increased the max_pngs substantially, or turned on parallelization?

Will

Hi Will!

Here is my current command:

fit_out <- maaslin3(input_data=data.frame(otu_table(phy)),
                    input_metadata=data.frame(sample_data(phy)),
                    formula = "~ myvar",
                    output="maaslin3",
                    normalization="NONE",
                    transform="LOG",
                    standardize=FALSE,
                    max_significance=0.05,
                    min_abundance=100,
                    min_prevalence=0.1,
                    max_pngs=1,
                    warn_prevalence=FALSE,
                    cores=future::availableCores())

My input_data has about 6000 features, and 900 samples; using my filters here, about 5000 are filtered out. My input_metadata is ~1000 variables that I don’t need and could cut down. I definitely wanted to ensure that I wasn’t saving models or making pngs. The model should also be mixed-effects, but I thought maybe the mixed effects models were causing issues (in the pre-Maaslin days, I had a lot of related trouble with glmmTMB), so I pulled that part out too…

Update: When I limit my sample metadata to only the variables I want (ie, data.frame(sample_data(phy)[,c("myvar","myvar2")]), the total memory usage is much better. I only use about 3GB during the logistic phase, instead of multiple GB, and my code that I shared above works.

I still have the 3GB of memory used after successful completion of Maaslin3, and another run of Maaslin3 will use an additional 3GB on top of that, so I am limited in the total number of Maaslin3’s I can run in a specific R session without restarting R…

If you run with 1 core rather than future::availableCores(), do you have the same memory leakage and maximum memory issues? My guess is that the pbapply package is copying your entire working memory for each parallel core. In my own runs, I’ve found that using multiple cores often doesn’t speed things up that much and seriously inflates the memory used. I’ve never found parallelization in R to be great, and it’s possible there’s a bug in the pbapply package.

That seems to about halved the total memory, so I’m down to 1.5GB per Maaslin3 run, which is definitely improved from 100GB!

Sounds good - I’ll consider this resolved unless you think the runtime is now going to be too long.