Dividing an abundance table for multiple Maaslin runs

Hello!

I am trying to run maaslin on the gene families output of humann4, the table is very large and using it as an input into maaslin causes memory issues. Can one break up the table and run maaslin on each of the subsets?

Thanks for your help

1 Like

Hi there @Milad4849

You could break up your run into multiple MaAsLin3 runs, however, if you do it this way you should normalize and filter your data outside of MaAsLin3 and then turn those options off in each individual MaAsLin3 call. Further you will also need to combine all of the all_results.tsv tables and recalculate corrected q values from the original p values using FDR adjustment.

Cheers,
Jacob Nearing

Also, are you trying to use multiple cores? If you just use 1 core, it usually isn’t that much slower but it saves a ton of memory.

1 Like

Hi and @WillNickols ,

@nearinj That worked well for me, thanks! @WillNickols, we noticed that and went down to 1 come.

My joined gene families file is around 1.5 Gb and 250Gb of RAM was not enough. We also have a 300Gb joined table…I am guessing this can be a broader issue. Do you think that handling of large input tables, as @nearinj wrote, is a functionality worth adding to MaAsLin3?

Hi, is that to say that even at 1 core 250 Gb RAM was not enough for the 1.5 Gb table? In my experience, the memory usage shouldn’t be that high, even with a big dataset. Can you send the maaslin3 command you’re running?

A 300 Gb table will probably be too much, at least without batches. What’s in that table?

Will

I double checked, I used 2 cores in that case. Here is my maaslin3 call:

maaslin3(
input_data = adata,
input_metadata = metadata,
output = output_dir,
formula = ‘~ Conc + Day + Conc:Day + total_reads + (1 | Animal) + (1 | cage)’,
min_abundance = 0.001, 
min_prevalence = 0.1,
normalization = “NONE”,
transform = “LOG”,
)

One a side note; I could not find a way to specify both of the random variables as small

The 300Gb table is the joined gene families table of all the samples in a large dataset with multiple timepoints and treatment arms.

I guess we are running into issues since we run maaslin directly on the gene families table, which is not that informative and probably not what people typically do. At the pathway level, the tables are never that large.

Some people run MaAsLin directly on the gene families, but indeed that’s not the norm. Question on your formula: is each individual animal always in the same cage (e.g. animals 1, 2, 3 are in cage 1 and 4, 5, 6 are in cage 2, but you never have animal 1 in cage 1 sometimes and cage 2 another time)? If so, you don’t actually need the (1 | cage) since it’s collinear and therefore redundant with (1 | Animal) - this might also be contributing to the fitting difficulties.

Yes, the animal-cage association never changes and the random variables are collinear. Indeed removing (1 | cage) speeds up the computation quite a bit. Thanks @WillNickols!

I have a follow-up question; For both Cage and Animal groupings, we have 1-4 observations per group. i set the random variable to small, does the fact that we have cases of one observation per group cause issues here? Does it jus tnot estimate an intercept for those?

If you have 1 observation in some groups, the intercept for that group will be set to perfectly explain the observation. You’ll still get an intercept, but that point just won’t contribute any information to your overall effect.