Dividing an abundance table for multiple Maaslin runs

Milad4849 · August 20, 2025, 10:19am

Hello!

I am trying to run maaslin on the gene families output of humann4, the table is very large and using it as an input into maaslin causes memory issues. Can one break up the table and run maaslin on each of the subsets?

Thanks for your help

nearinj · August 20, 2025, 7:04pm

Hi there @Milad4849

You could break up your run into multiple MaAsLin3 runs, however, if you do it this way you should normalize and filter your data outside of MaAsLin3 and then turn those options off in each individual MaAsLin3 call. Further you will also need to combine all of the all_results.tsv tables and recalculate corrected q values from the original p values using FDR adjustment.

Cheers,
Jacob Nearing

WillNickols · August 22, 2025, 3:57am

Also, are you trying to use multiple cores? If you just use 1 core, it usually isn’t that much slower but it saves a ton of memory.

Milad4849 · August 25, 2025, 6:34pm

Hi and @WillNickols ,

@nearinj That worked well for me, thanks! @WillNickols, we noticed that and went down to 1 come.

My joined gene families file is around 1.5 Gb and 250Gb of RAM was not enough. We also have a 300Gb joined table…I am guessing this can be a broader issue. Do you think that handling of large input tables, as @nearinj wrote, is a functionality worth adding to MaAsLin3?

WillNickols · August 25, 2025, 7:38pm

Hi, is that to say that even at 1 core 250 Gb RAM was not enough for the 1.5 Gb table? In my experience, the memory usage shouldn’t be that high, even with a big dataset. Can you send the maaslin3 command you’re running?

A 300 Gb table will probably be too much, at least without batches. What’s in that table?

Will

Milad4849 · August 27, 2025, 11:12pm

I double checked, I used 2 cores in that case. Here is my maaslin3 call:

maaslin3(
input_data = adata,
input_metadata = metadata,
output = output_dir,
formula = ‘~ Conc + Day + Conc:Day + total_reads + (1 | Animal) + (1 | cage)’,
min_abundance = 0.001, 
min_prevalence = 0.1,
normalization = “NONE”,
transform = “LOG”,
)

One a side note; I could not find a way to specify both of the random variables as small

The 300Gb table is the joined gene families table of all the samples in a large dataset with multiple timepoints and treatment arms.

I guess we are running into issues since we run maaslin directly on the gene families table, which is not that informative and probably not what people typically do. At the pathway level, the tables are never that large.

WillNickols · August 27, 2025, 11:41pm

Some people run MaAsLin directly on the gene families, but indeed that’s not the norm. Question on your formula: is each individual animal always in the same cage (e.g. animals 1, 2, 3 are in cage 1 and 4, 5, 6 are in cage 2, but you never have animal 1 in cage 1 sometimes and cage 2 another time)? If so, you don’t actually need the (1 | cage) since it’s collinear and therefore redundant with (1 | Animal) - this might also be contributing to the fitting difficulties.

Milad4849 · September 1, 2025, 10:50am

Yes, the animal-cage association never changes and the random variables are collinear. Indeed removing (1 | cage) speeds up the computation quite a bit. Thanks @WillNickols!

I have a follow-up question; For both Cage and Animal groupings, we have 1-4 observations per group. i set the random variable to small, does the fact that we have cases of one observation per group cause issues here? Does it jus tnot estimate an intercept for those?

WillNickols · September 1, 2025, 3:16pm

If you have 1 observation in some groups, the intercept for that group will be set to perfectly explain the observation. You’ll still get an intercept, but that point just won’t contribute any information to your overall effect.

Topic		Replies	Views
Maaslin3 Memory Consumption during Logistic/Prevalence Abundance MaAsLin	6	72	February 21, 2025
Maaslin3 stuck at filtering MaAsLin	9	126	March 24, 2025
Computer freezes when running a particular file in MaasLin2 MaAsLin	3	342	February 14, 2023
Masslin2 runs forever for 530 K associations MaAsLin	6	658	January 14, 2021
MaAsLin3 with estimated microbial load MaAsLin	8	114	December 24, 2024

Dividing an abundance table for multiple Maaslin runs

Related topics