I was able run masslin2 on below data file without any issues
data file: genefamilies.tsv (23 samples, 1.7 million features)
However, maaslin3 (v 0.99.7) get stuck filtering features based on abundance and prevalence. Maaslin3 successfully analyzed a smaller data file (pathabundance.tsv, 23 samples, 475 features).
When you say it’s stuck, do you somehow know that it’s completely frozen, or is it just taking a long time to subset that data? How long has it taken? 1.7 million features is enormous, and if there’s any upstream subsetting you could do, that might help speed things up. (MaAsLin 3 should be fine for 100s to 10ks of features but maybe not for millions depending on your computing system.)
Hi @WillNickols
Thanks for the prompt response.
Last log entry is INFO: Min samples required with min abundance for a feature not to be filtered: 0:000000.
This was about 15 hours ago.
This is unlike maaslin2 with same data file on same computing system (5 CPUs, 35GB). I started maaslin2 3 hours ago and it is fitted model to 1756560 features and currently “counting total values for each feature”. maaslin2 has written 1756585 lines to log file in 3 hours, whereas maaslin3 has written 13 lines in 15 hours.
Gurjit
If the minimum samples required to not be filtered is 0 (from your outputs), MaAsLin 3 is probably trying to filter and then handle all your features (whereas MaAsLin 2 has probably dropped a bunch and is therefore faster). Since you’re running with normalization='NONE', are you sure that min_abundance and min_prevalence apply to the original scale of your data? With gene families, if they’re not already in normalized form, it’s quite plausible that more than 10% of the samples contain more than 0.1 count of every family. Have you applied the same filters and normalization to MaAsLin 2? Also, in developing MaAsLin 3, we reordered the normalization, filtering, and transformation steps so that they made more sense, and that might explain the difference.
With 1.7 million features and CPMs, you probably should have a lot of features that drop, so the fact that nothing drops in MaAsLin 3 is a bit surprising. If you can, I’d subset the features and check that the filtering is working as you expected with fewer features. Then, if you try some different subset sizes, you should be able to estimate how long this step should take and use this to evaluate whether it’s just slow or something’s actually broken. I’ll also warn that the filtering step is normally very fast compared to the rest of the run, and fitting 1.7 million linear models might take a very very long time if no features are filtered (though you do have a simple formula which should help).