MTXmodel / Maaslin2 - Max feature number?

Dear bioBakery devs,

I am a novice in metagenomic / metatranscriptomics analysis and slowly progressing on the steep learning curve.

I came across your article “Statistical approaches for differential expression analysis in metatranscriptomics” from @YancongZhang and since I have both MGX and MTX data, the methods you propose for normalizing the MTX data with paired MGX data is something I’d like to apply.

The experiment is about comparing the effect of phytoremediation on a contaminant. We sampled from both the planted and unplanted mesocosms over time which generated longitudinal data with 4 time points. At each time point I have 8 samples as in this post.

I have transformed both the DNA and RNA datasets with a TSS transformation. I then proceeded to subset the genes that are present in each of the 2 condition and in 3 out of 4 samples in either condition, to have more “confidence” in the differential abundance results. This still leaves me with 7 850 344 genes. I’ve seen in this post, that @franzosa advised to replace the removed genes with a single per-sample “other” feature that absorbs the removed mass; which I haven’t done for now…

I ran the following model :

> fit_data <- MTXmodel(D42_0.75_ubi_rna_t, 
>                      D42_sed_sdata, 
>                      output="MTXmodel_test", 
>                      fixed_effects = c("Sample_type"),
>                      random_effects = NULL,
>                      reference = "Sample_type,No_plant",
>                      min_abundance = 0.0,
>                      min_prevalence = 0.0,
>                      normalization = 'NONE',
>                      transform = "LOG",
>                      correction="BH",
>                      max_significance=0.05,
>                      standardize = FALSE,
>                      input_dnadata = D42_0.75_ubi_gene_t,
>                      rna_dna_flt = "lenient" )

and here is part of the output:

> 2024-05-02 14:50:55.54114 INFO::Total samples in data: 8
> 2024-05-02 14:50:55.542702 INFO::Min samples required with min abundance for a feature not to be filtered: 0.000000
> 2024-05-02 14:56:06.15202 INFO::Total filtered features: 4024563

From what i understand, the model filtered out 4024563 features. Still the model timed out after 5hours of computing an a 120G ram cluster.

My questions are :

  1. Would you reduce even more the number of features ?

  2. If so would you rather use the min_abundance and min_prevalence parameters from the model. Would that solve the problem created by the manual subsetting which causes the libraries to differ in size ?

  3. Does the MTXmodel function automaticly detects which model number to apply, in my case it would be M4.

Many thanks,

Simon

Hi Simon,

Re1-2: Based on your current setting for MTXmodel, you required that all features for modeling had abundance>0 in at least one sample, otherwise they would be filtered. Turning min_abundance and min_prevalence parameters could help filtering features. For example, if you set min_prevalence=0.25, then the model will select features detected with >0 abundance in at least two samples in your case. However, this strategy filters features based on their detection on the whole dataset, slightly different from what you manually did within stratified conditions.

Re3: MTXmodel is implemented for linear models with DNA covariates (i.e. M5 and M6 in our papers).

Thanks!
Yancong

1 Like

Hello Yancong,

Thank you for your quick reply !

Re1-2: Based on your current setting for MTXmodel, you required that all features for modeling had abundance>0 in at least one sample, otherwise they would be filtered. Turning min_abundance and min_prevalence parameters could help filtering features. For example, if you set min_prevalence=0.25, then the model will select features detected with >0 abundance in at least two samples in your case. However, this strategy filters features based on their detection on the whole dataset, slightly different from what you manually did within stratified conditions.

I see, and on a more practical point of view, do you think it’s worthwhile to launch this analysis on millions of features when I only have 8 samples? A lot of examples I’ve seen so far have “just” hundreds or thousands features but hundreds of samples…

Re3: MTXmodel is implemented for linear models with DNA covariates (i.e. M5 and M6 in our papers).

Ok, I was originally set on trying to use the RNA/DNA ratio as an input for differential abundance analysis but I couldn’t really find any relevant package that was appropriate for this kind of data. Since your experiment found M6 more performant than M4, then it’s even better!

Simon

Hi Simon,

It’s very likely that you pick spurious features randomly from a dataset with very small sample-size. You may want to use more strict criteria to refine your features to avoid spurious associations.

Thanks!
Yancong

Hi Yancong,

I understand, I’ll see how to subset more stringently my dataset.
Thanks for your help !

Simon

Hi Yancong,

Somewhat related question, it seems I’m having trouble “activating” the cores with the MTXmodel function even though I am specifying them in the cores variable.
Do you have to something else ? I’ve tried using doParallel such as:

no_cores <- 32
cl <- makeCluster(no_cores)
registerDoParallel(cl)

prior to run the MTXmodel with cores = 32, but it did not seem to work.

Simon

Hi Simon,

I apologized for my late reply. Since MTXmodel creates a unique formula for each feature in the modeling, it’s a bit of a challenge to make parallel computing due to different tasks. With such implementation concerns, the current version of MTXmodel doesn’t activate the cores variable yet.

Best,
Yancong