Maaslin3 stuck at filtering

Hi

I was able run masslin2 on below data file without any issues

data file: genefamilies.tsv (23 samples, 1.7 million features)

However, maaslin3 (v 0.99.7) get stuck filtering features based on abundance and prevalence. Maaslin3 successfully analyzed a smaller data file (pathabundance.tsv, 23 samples, 475 features).

Maaslin3 script is below (R 4.4.2)

fit_out ← maaslin3(input_data = df_genefamilies,
input_metadata = df_metadata,
output = ‘Maaslin3-genefamilies’,
formula = ‘~ Phenotype’,
reference = c(“Phenotype,Resistant”),
min_abundance = 0.1,
min_prevalence = 0.1,
normalization = ‘NONE’,
transform = ‘LOG’,
augment = TRUE,
standardize = FALSE,
max_significance = 0.1,
median_comparison_abundance = TRUE,
median_comparison_prevalence = FALSE,
max_pngs = 250,
cores = 1)

I would appreciate if you could help me resolve this issue.
Thanks
Gurjit

Hi Gurjit,

When you say it’s stuck, do you somehow know that it’s completely frozen, or is it just taking a long time to subset that data? How long has it taken? 1.7 million features is enormous, and if there’s any upstream subsetting you could do, that might help speed things up. (MaAsLin 3 should be fine for 100s to 10ks of features but maybe not for millions depending on your computing system.)

Will

Hi @WillNickols
Thanks for the prompt response.
Last log entry is INFO: Min samples required with min abundance for a feature not to be filtered: 0:000000.
This was about 15 hours ago.
This is unlike maaslin2 with same data file on same computing system (5 CPUs, 35GB). I started maaslin2 3 hours ago and it is fitted model to 1756560 features and currently “counting total values for each feature”. maaslin2 has written 1756585 lines to log file in 3 hours, whereas maaslin3 has written 13 lines in 15 hours.
Gurjit

If the minimum samples required to not be filtered is 0 (from your outputs), MaAsLin 3 is probably trying to filter and then handle all your features (whereas MaAsLin 2 has probably dropped a bunch and is therefore faster). Since you’re running with normalization='NONE', are you sure that min_abundance and min_prevalence apply to the original scale of your data? With gene families, if they’re not already in normalized form, it’s quite plausible that more than 10% of the samples contain more than 0.1 count of every family. Have you applied the same filters and normalization to MaAsLin 2? Also, in developing MaAsLin 3, we reordered the normalization, filtering, and transformation steps so that they made more sense, and that might explain the difference.

Input file has abundance as CPM (output of human renorm).

I tried to keep maaslin3 parameters similar to maaslin2. Here is maaslin2 script.

fit_data ← Maaslin2(
input_data = df_gf_unstrat_v1,
input_metadata = df_metadata_v2,
min_prevalence = 0,
normalization = “NONE”,
transform = “LOG”,
fixed_effects = c(“Phenotype”),
max_significance = 0.1,
max_pngs = 100,
reference = c(“Phenotype,Resistant”),
output = “humann/Maaslin2-gf_unstrat_v1_cpm”)

I will let maaslin3 run. Maybe it needs longer to finish.

Thanks for your inputs.

With 1.7 million features and CPMs, you probably should have a lot of features that drop, so the fact that nothing drops in MaAsLin 3 is a bit surprising. If you can, I’d subset the features and check that the filtering is working as you expected with fewer features. Then, if you try some different subset sizes, you should be able to estimate how long this step should take and use this to evaluate whether it’s just slow or something’s actually broken. I’ll also warn that the filtering step is normally very fast compared to the rest of the run, and fitting 1.7 million linear models might take a very very long time if no features are filtered (though you do have a simple formula which should help).

Hi @WillNickols

I am still struggling with Maaslin 3 on my data. Just to make sure that there is no issue with data file, I analyzed same file (abundance in CPM) with Maaslin 2 and Maaslin 3. Here is summary of outcomes

Maaslin 2 successfully completed the job in 1d 6h with 5 CPUs and 35GB MEM. Abridged and edited log file maaslin2_61038396-b.txt attached.
maaslin2_61038396-b.txt (4.2 KB)

I ran the same file thru Maaslin 3 with 50 CPUs and 500 GB MEM. Job timed out after 96 hours. See attached abridged and edited log file maaslin3c_61829627-b.txt. Maaslin 3 took 3d 15h to filter the features (Maaslin 2 took few seconds at this step).
maaslin3c_61829627-b.txt (2.6 KB)

I would appreciate if you could suggest a fix to this problem.

Thanks

Gurjit

Hi Gurjit,

Could you try choosing a min_abundance and min_prevalence (maybe 0.01 for both) and seeing if this cuts down the size of your data? With 1.7 million features and only 23 samples, you’re very unlikely to get any significant results after FDR correction, even if the upstream steps all work. If you can filter these in advance, that might also speed things up as well.

Will

Hi @WillNickols

I started a job with min_abundance = 0.01 and min_prevalence = 0.1. It has been stuck at filtering step for 1h 45 min. I will update if it makes any progress. This time running with 5 CPUs and 70 GB MEM. Previous job with 70BG MEM got killed because of insufficient memory.

I have 96hr time limit if I use more than 70GB. Is it possible to run maaslin3 in two stages? In first stage get filtered_data.tsv (it took about 3d 15h to get filtered data). Then, use filtered_data.tsv as input and start modelling (skipping filtration step).

Thanks

Gurjit

You can run step by step as described here (search for the line “equivalent results would be produced from running one step at a time”). This should let you store the results of the filtering step so you can use them without re-filtering.