Hello!
I am trying to run maaslin on the gene families output of humann4, the table is very large and using it as an input into maaslin causes memory issues. Can one break up the table and run maaslin on each of the subsets?
Thanks for your help
Hello!
I am trying to run maaslin on the gene families output of humann4, the table is very large and using it as an input into maaslin causes memory issues. Can one break up the table and run maaslin on each of the subsets?
Thanks for your help
Hi there @Milad4849
You could break up your run into multiple MaAsLin3 runs, however, if you do it this way you should normalize and filter your data outside of MaAsLin3 and then turn those options off in each individual MaAsLin3 call. Further you will also need to combine all of the all_results.tsv
tables and recalculate corrected q values from the original p values using FDR adjustment.
Cheers,
Jacob Nearing
Also, are you trying to use multiple cores? If you just use 1 core, it usually isnât that much slower but it saves a ton of memory.
Hi and @WillNickols ,
@nearinj That worked well for me, thanks! @WillNickols, we noticed that and went down to 1 come.
My joined gene families file is around 1.5 Gb and 250Gb of RAM was not enough. We also have a 300Gb joined tableâŚI am guessing this can be a broader issue. Do you think that handling of large input tables, as @nearinj wrote, is a functionality worth adding to MaAsLin3?
Hi, is that to say that even at 1 core 250 Gb RAM was not enough for the 1.5 Gb table? In my experience, the memory usage shouldnât be that high, even with a big dataset. Can you send the maaslin3 command youâre running?
A 300 Gb table will probably be too much, at least without batches. Whatâs in that table?
Will
I double checked, I used 2 cores in that case. Here is my maaslin3 call:
maaslin3(
input_data = adata,
input_metadata = metadata,
output = output_dir,
formula = â~ Conc + Day + Conc:Day + total_reads + (1 | Animal) + (1 | cage)â,
min_abundance = 0.001,
min_prevalence = 0.1,
normalization = âNONEâ,
transform = âLOGâ,
)
One a side note; I could not find a way to specify both of the random variables as small
The 300Gb table is the joined gene families table of all the samples in a large dataset with multiple timepoints and treatment arms.
I guess we are running into issues since we run maaslin directly on the gene families table, which is not that informative and probably not what people typically do. At the pathway level, the tables are never that large.
Some people run MaAsLin directly on the gene families, but indeed thatâs not the norm. Question on your formula: is each individual animal always in the same cage (e.g. animals 1, 2, 3 are in cage 1 and 4, 5, 6 are in cage 2, but you never have animal 1 in cage 1 sometimes and cage 2 another time)? If so, you donât actually need the (1 | cage)
since itâs collinear and therefore redundant with (1 | Animal)
- this might also be contributing to the fitting difficulties.
Yes, the animal-cage association never changes and the random variables are collinear. Indeed removing (1 | cage) speeds up the computation quite a bit. Thanks @WillNickols!
I have a follow-up question; For both Cage and Animal groupings, we have 1-4 observations per group. i set the random variable to small, does the fact that we have cases of one observation per group cause issues here? Does it jus tnot estimate an intercept for those?
If you have 1 observation in some groups, the intercept for that group will be set to perfectly explain the observation. Youâll still get an intercept, but that point just wonât contribute any information to your overall effect.