Taxon-covariate in MTXmodel

I’d like to test out taxon-(RNA) covariate analysis as described in Statistical approaches for differential expression analysis in metatranscriptomics | Bioinformatics | Oxford Academic
The publication directs me to GitHub - biobakery/MTX_model: R package for differential expression analysis in metatranscriptomics for the “analysis code”. However, the repository / tool seems to be aimed only at feature-DNA covariate analysis.

Can I use the tool for taxon-RNA covariate analysis by providing as --input_dnadata a file where each feature/sample is assigned the TPM sum of the related taxon in the sample? If not, is there another source for the code that was used to produce the results for models 2 and 5?

Best regards,
Tom

Hi Tom,

You can use MTXmodel to calculate DE using taxon-RNA covariate as well. In this case, your RNA features and their corresponding source taxon should be matched. For example, you could format your input files as:

RNA feature file:

ID sample1 sample2 …
Gene1__Taxon1 value value
Gene2__Taxon1 value value
Gene3__Taxon2 value value
…

Taxon-RNA covariate file:

ID sample1 sample2 …
Taxon1 value value
Taxon2 value value
…

Thanks!
Yancong

1 Like

Hi Yancong,

Thank you for the quick response! How would I feed this Taxon-RNA covariate file into the MTXmodel tool? The only argument that is documented for covariate input is --input_dnadata. And that one seems to require data on the level of individual gene IDs.

Regards,
Tom

Hi Tom,

The taxon-RNA covariate is feeded to --input_dnadata. In this case, taxon-RNA acts as a proxy for taxon abundance when MGX data is unavailable, with taxon abundance offering an estimate of the gene copy number per feature, assuming consistent encoding.

Thanks!
Yancong

1 Like

Hi Yancong,

once again, thank you for getting back to me so fast. I have preprocessed my dataset to create the inputs as you described. For testing, I am using only a subset of the data with a single metadata variable (time).

My count data file looks like this (prefixes being taxa names is just an artifact of the prior analysis pipeline)

ID control_0_A control_0_B control_0_C Pea_30_A Pea_30_B Pea_30_C Pea_480_A
L_plantarum_1_2__L_plantarum 1341.0 1095.0 1024.0 15.0 25.0 24.0 279.0
L_plantarum_1_3__L_plantarum 441.0 379.0 363.0 8.0 6.0 6.0 80.0
L_plantarum_1_4__L_plantarum 204.0 172.0 142.0 5.0 4.0 5.0 35.0

The taxon-RNA covariate data fed as --input_dnadata is formatted like this:

ID control_0_A control_0_B control_0_C Pea_30_A Pea_30_B Pea_30_C Pea_480_A
L_plantarum 3111281.0 2526283.0 2359353.0 107508.0 169824.0 157053.0 482013.0
Le_mesenteroides 502052.0 378775.0 352994.0 48116.0 104420.0 106500.0 327485.0

metadata looks like this:

ID control_0_A control_0_B control_0_C Pea_30_A Pea_30_B Pea_30_C Pea_480_A
time 0 0 0 30 30 30 480

I run the tool as follows:

MTXmodel.R \
        {counts_path} \
        {metadata_path} \
        {output_dir} \
        --min_abundance 0 \
        --min_prevalence 0.0 \
        --max_significance 0.25 \
        --min_variance 0.0 \
        --correction BH \
        --standardize FALSE \
        --normalization TSS \
        --transform LOG \
        --analysis_method LM \
        --cores 1 \
        --fixed_effects time \
        --reference time,30 \
        --rna_dna_flt lenient\
        --input_dnadata {covariate_path} 

Inspecting the log, it looks like detection of the input format is going wrong. metadata samples is columns, then rows.

2024-06-06 07:49:43.173591 INFO::Determining format of input files
2024-06-06 07:49:43.175754 INFO::Input format is data samples as columns and metadata samples as columns
2024-06-06 07:49:43.204545 INFO::Input format is dnadata samples as columns and metadata samples as rows
2024-06-06 07:49:43.264767 INFO::Formula for fixed effects: expr ~  time

The output doesn’t make sense to me either, since every value is time instead of the actual timepoints (30, 480 etc.).

feature metadata value coef stderr N N.not.0 pval qval
L_plantarum_9_115__L_plantarum time time -0.0010509 4.17e-05 12 12 1.18e-09 3.15e-06
Le_mesenteroides_22_6__Le_mesenteroides time time -0.0011778 4.91e-05 12 12 1.82e-09 3.15e-06

I would be very thankful for help making this work.

Best regards,
Tom

Hi Tom,

For your first question, MTXmodel formats all the input files (i.e. feature data, taxon-RNA covariate data, metadata sample) into tables with sample ID as rows. In your case, MTXmodel first checked your feature file and metadata file, and converted metadata samples from columns to rows. Thus, in the following checking for covariate file and the formatted metadata, metadata samples were detected as rows.

For your second question, MTXmodel reports the outputs of continuous variables slightly different from category variables. Your metadata variable (i.e. time) was treated as a continuous variable. In this case, MTXmodel reported the statistical result from linear regression (i.e. regressing time-point values against feature values) in one row per feature, where the value column represents what continuous variable it is rather than each time point value.

Thanks!
Yancong

1 Like

Hi Yancong,

thank you again for all your help. It took me some time to get back to this project. I think I have the tool working as it should now, but would be very thankful for some advise on the usage:

For the time series I’d expect expression of individual genes to go up and down over the course of the experiment. Thus, I want to avoid modeling a linear regression spanning >2 time points. I have split up the dataset and fed in the data in pairs (e.g. 3 samples from t0 and 3 samples from t30).

Can you give me a recommendation for the parameters in this setup? Will CSS + a negative binomial model combined with taxon covariates produce sensible results?

Best regards,
Tom

Hi Tom,

If you’d like to examine differential expression between time points, categorizing your time series and treating this categorized variable as a fixed effect should work (e.g. time_category: t0, t30).

According to our previous evaluation using synthetic data (PMID: 34784344), we observed inconsistent behavior and many false positive findings for existing negative binomials; LM-based approaches showed a good statistical power while controlling false discoveries. If you do want to see differences between models, you could try different models and see which fit best to your expectations.

Thanks!
Yancong

1 Like

Hi Yancong,

I ran the different models before my last post and was surprised at how much the results differed. Thank you for the answer and the reference. I’ll just stick with TSS + LM + Taxon Covariate.

I have only used what you call ‘taxon-specific scaling’ plus DESeq in the past and originally wanted to try switching to the Feature-DNA covariate approach you suggest. But alas, I don’t have DNA count data for this current dataset. Hence the M3 approach.

Thank you again for your help & patience. As somebody who mainly does software development statistical analysis often feels like a minefield.

Cheers,
Tom