Maaslin3 stuck at filtering

Hi

I was able run masslin2 on below data file without any issues

data file: genefamilies.tsv (23 samples, 1.7 million features)

However, maaslin3 (v 0.99.7) get stuck filtering features based on abundance and prevalence. Maaslin3 successfully analyzed a smaller data file (pathabundance.tsv, 23 samples, 475 features).

Maaslin3 script is below (R 4.4.2)

fit_out ← maaslin3(input_data = df_genefamilies,
input_metadata = df_metadata,
output = ‘Maaslin3-genefamilies’,
formula = ‘~ Phenotype’,
reference = c(“Phenotype,Resistant”),
min_abundance = 0.1,
min_prevalence = 0.1,
normalization = ‘NONE’,
transform = ‘LOG’,
augment = TRUE,
standardize = FALSE,
max_significance = 0.1,
median_comparison_abundance = TRUE,
median_comparison_prevalence = FALSE,
max_pngs = 250,
cores = 1)

I would appreciate if you could help me resolve this issue.
Thanks
Gurjit

Hi Gurjit,

When you say it’s stuck, do you somehow know that it’s completely frozen, or is it just taking a long time to subset that data? How long has it taken? 1.7 million features is enormous, and if there’s any upstream subsetting you could do, that might help speed things up. (MaAsLin 3 should be fine for 100s to 10ks of features but maybe not for millions depending on your computing system.)

Will

Hi @WillNickols
Thanks for the prompt response.
Last log entry is INFO: Min samples required with min abundance for a feature not to be filtered: 0:000000.
This was about 15 hours ago.
This is unlike maaslin2 with same data file on same computing system (5 CPUs, 35GB). I started maaslin2 3 hours ago and it is fitted model to 1756560 features and currently “counting total values for each feature”. maaslin2 has written 1756585 lines to log file in 3 hours, whereas maaslin3 has written 13 lines in 15 hours.
Gurjit

If the minimum samples required to not be filtered is 0 (from your outputs), MaAsLin 3 is probably trying to filter and then handle all your features (whereas MaAsLin 2 has probably dropped a bunch and is therefore faster). Since you’re running with normalization='NONE', are you sure that min_abundance and min_prevalence apply to the original scale of your data? With gene families, if they’re not already in normalized form, it’s quite plausible that more than 10% of the samples contain more than 0.1 count of every family. Have you applied the same filters and normalization to MaAsLin 2? Also, in developing MaAsLin 3, we reordered the normalization, filtering, and transformation steps so that they made more sense, and that might explain the difference.

Input file has abundance as CPM (output of human renorm).

I tried to keep maaslin3 parameters similar to maaslin2. Here is maaslin2 script.

fit_data ← Maaslin2(
input_data = df_gf_unstrat_v1,
input_metadata = df_metadata_v2,
min_prevalence = 0,
normalization = “NONE”,
transform = “LOG”,
fixed_effects = c(“Phenotype”),
max_significance = 0.1,
max_pngs = 100,
reference = c(“Phenotype,Resistant”),
output = “humann/Maaslin2-gf_unstrat_v1_cpm”)

I will let maaslin3 run. Maybe it needs longer to finish.

Thanks for your inputs.

With 1.7 million features and CPMs, you probably should have a lot of features that drop, so the fact that nothing drops in MaAsLin 3 is a bit surprising. If you can, I’d subset the features and check that the filtering is working as you expected with fewer features. Then, if you try some different subset sizes, you should be able to estimate how long this step should take and use this to evaluate whether it’s just slow or something’s actually broken. I’ll also warn that the filtering step is normally very fast compared to the rest of the run, and fitting 1.7 million linear models might take a very very long time if no features are filtered (though you do have a simple formula which should help).