Taxon-covariate in MTXmodel

I’d like to test out taxon-(RNA) covariate analysis as described in Statistical approaches for differential expression analysis in metatranscriptomics | Bioinformatics | Oxford Academic
The publication directs me to GitHub - biobakery/MTX_model: R package for differential expression analysis in metatranscriptomics for the “analysis code”. However, the repository / tool seems to be aimed only at feature-DNA covariate analysis.

Can I use the tool for taxon-RNA covariate analysis by providing as --input_dnadata a file where each feature/sample is assigned the TPM sum of the related taxon in the sample? If not, is there another source for the code that was used to produce the results for models 2 and 5?

Best regards,
Tom

Hi Tom,

You can use MTXmodel to calculate DE using taxon-RNA covariate as well. In this case, your RNA features and their corresponding source taxon should be matched. For example, you could format your input files as:

RNA feature file:

ID sample1 sample2 …
Gene1__Taxon1 value value
Gene2__Taxon1 value value
Gene3__Taxon2 value value
…

Taxon-RNA covariate file:

ID sample1 sample2 …
Taxon1 value value
Taxon2 value value
…

Thanks!
Yancong

1 Like

Hi Yancong,

Thank you for the quick response! How would I feed this Taxon-RNA covariate file into the MTXmodel tool? The only argument that is documented for covariate input is --input_dnadata. And that one seems to require data on the level of individual gene IDs.

Regards,
Tom

Hi Tom,

The taxon-RNA covariate is feeded to --input_dnadata. In this case, taxon-RNA acts as a proxy for taxon abundance when MGX data is unavailable, with taxon abundance offering an estimate of the gene copy number per feature, assuming consistent encoding.

Thanks!
Yancong

1 Like

Hi Yancong,

once again, thank you for getting back to me so fast. I have preprocessed my dataset to create the inputs as you described. For testing, I am using only a subset of the data with a single metadata variable (time).

My count data file looks like this (prefixes being taxa names is just an artifact of the prior analysis pipeline)

ID control_0_A control_0_B control_0_C Pea_30_A Pea_30_B Pea_30_C Pea_480_A
L_plantarum_1_2__L_plantarum 1341.0 1095.0 1024.0 15.0 25.0 24.0 279.0
L_plantarum_1_3__L_plantarum 441.0 379.0 363.0 8.0 6.0 6.0 80.0
L_plantarum_1_4__L_plantarum 204.0 172.0 142.0 5.0 4.0 5.0 35.0

The taxon-RNA covariate data fed as --input_dnadata is formatted like this:

ID control_0_A control_0_B control_0_C Pea_30_A Pea_30_B Pea_30_C Pea_480_A
L_plantarum 3111281.0 2526283.0 2359353.0 107508.0 169824.0 157053.0 482013.0
Le_mesenteroides 502052.0 378775.0 352994.0 48116.0 104420.0 106500.0 327485.0

metadata looks like this:

ID control_0_A control_0_B control_0_C Pea_30_A Pea_30_B Pea_30_C Pea_480_A
time 0 0 0 30 30 30 480

I run the tool as follows:

MTXmodel.R \
        {counts_path} \
        {metadata_path} \
        {output_dir} \
        --min_abundance 0 \
        --min_prevalence 0.0 \
        --max_significance 0.25 \
        --min_variance 0.0 \
        --correction BH \
        --standardize FALSE \
        --normalization TSS \
        --transform LOG \
        --analysis_method LM \
        --cores 1 \
        --fixed_effects time \
        --reference time,30 \
        --rna_dna_flt lenient\
        --input_dnadata {covariate_path} 

Inspecting the log, it looks like detection of the input format is going wrong. metadata samples is columns, then rows.

2024-06-06 07:49:43.173591 INFO::Determining format of input files
2024-06-06 07:49:43.175754 INFO::Input format is data samples as columns and metadata samples as columns
2024-06-06 07:49:43.204545 INFO::Input format is dnadata samples as columns and metadata samples as rows
2024-06-06 07:49:43.264767 INFO::Formula for fixed effects: expr ~  time

The output doesn’t make sense to me either, since every value is time instead of the actual timepoints (30, 480 etc.).

feature metadata value coef stderr N N.not.0 pval qval
L_plantarum_9_115__L_plantarum time time -0.0010509 4.17e-05 12 12 1.18e-09 3.15e-06
Le_mesenteroides_22_6__Le_mesenteroides time time -0.0011778 4.91e-05 12 12 1.82e-09 3.15e-06

I would be very thankful for help making this work.

Best regards,
Tom

Hi Tom,

For your first question, MTXmodel formats all the input files (i.e. feature data, taxon-RNA covariate data, metadata sample) into tables with sample ID as rows. In your case, MTXmodel first checked your feature file and metadata file, and converted metadata samples from columns to rows. Thus, in the following checking for covariate file and the formatted metadata, metadata samples were detected as rows.

For your second question, MTXmodel reports the outputs of continuous variables slightly different from category variables. Your metadata variable (i.e. time) was treated as a continuous variable. In this case, MTXmodel reported the statistical result from linear regression (i.e. regressing time-point values against feature values) in one row per feature, where the value column represents what continuous variable it is rather than each time point value.

Thanks!
Yancong

1 Like