Taxon-covariate in MTXmodel

ttubb · May 31, 2024, 2:08pm

I’d like to test out taxon-(RNA) covariate analysis as described in Statistical approaches for differential expression analysis in metatranscriptomics | Bioinformatics | Oxford Academic
The publication directs me to GitHub - biobakery/MTX_model: R package for differential expression analysis in metatranscriptomics for the “analysis code”. However, the repository / tool seems to be aimed only at feature-DNA covariate analysis.

Can I use the tool for taxon-RNA covariate analysis by providing as --input_dnadata a file where each feature/sample is assigned the TPM sum of the related taxon in the sample? If not, is there another source for the code that was used to produce the results for models 2 and 5?

Best regards,
Tom

YancongZhang · May 31, 2024, 3:35pm

Hi Tom,

You can use MTXmodel to calculate DE using taxon-RNA covariate as well. In this case, your RNA features and their corresponding source taxon should be matched. For example, you could format your input files as:

RNA feature file:

ID sample1 sample2 …
Gene1__Taxon1 value value
Gene2__Taxon1 value value
Gene3__Taxon2 value value
…

Taxon-RNA covariate file:

ID sample1 sample2 …
Taxon1 value value
Taxon2 value value
…

Thanks!
Yancong

ttubb · June 3, 2024, 10:33am

Hi Yancong,

Thank you for the quick response! How would I feed this Taxon-RNA covariate file into the MTXmodel tool? The only argument that is documented for covariate input is --input_dnadata. And that one seems to require data on the level of individual gene IDs.

Regards,
Tom

YancongZhang · June 3, 2024, 12:44pm

Hi Tom,

The taxon-RNA covariate is feeded to --input_dnadata. In this case, taxon-RNA acts as a proxy for taxon abundance when MGX data is unavailable, with taxon abundance offering an estimate of the gene copy number per feature, assuming consistent encoding.

Thanks!
Yancong

ttubb · June 6, 2024, 8:25am

Hi Yancong,

once again, thank you for getting back to me so fast. I have preprocessed my dataset to create the inputs as you described. For testing, I am using only a subset of the data with a single metadata variable (time).

My count data file looks like this (prefixes being taxa names is just an artifact of the prior analysis pipeline)

ID	control_0_A	control_0_B	control_0_C	Pea_30_A	Pea_30_B	Pea_30_C	Pea_480_A	…
L_plantarum_1_2__L_plantarum	1341.0	1095.0	1024.0	15.0	25.0	24.0	279.0	…
L_plantarum_1_3__L_plantarum	441.0	379.0	363.0	8.0	6.0	6.0	80.0	…
L_plantarum_1_4__L_plantarum	204.0	172.0	142.0	5.0	4.0	5.0	35.0	…
…	…	…	…	…	…	…	…	…

The taxon-RNA covariate data fed as --input_dnadata is formatted like this:

ID	control_0_A	control_0_B	control_0_C	Pea_30_A	Pea_30_B	Pea_30_C	Pea_480_A	…
L_plantarum	3111281.0	2526283.0	2359353.0	107508.0	169824.0	157053.0	482013.0	…
Le_mesenteroides	502052.0	378775.0	352994.0	48116.0	104420.0	106500.0	327485.0	…
…	…	…	…	…	…	…	…	…

metadata looks like this:

ID	control_0_A	control_0_B	control_0_C	Pea_30_A	Pea_30_B	Pea_30_C	Pea_480_A	…
time	0	0	0	30	30	30	480	…

I run the tool as follows:

MTXmodel.R \
        {counts_path} \
        {metadata_path} \
        {output_dir} \
        --min_abundance 0 \
        --min_prevalence 0.0 \
        --max_significance 0.25 \
        --min_variance 0.0 \
        --correction BH \
        --standardize FALSE \
        --normalization TSS \
        --transform LOG \
        --analysis_method LM \
        --cores 1 \
        --fixed_effects time \
        --reference time,30 \
        --rna_dna_flt lenient\
        --input_dnadata {covariate_path}

Inspecting the log, it looks like detection of the input format is going wrong. metadata samples is columns, then rows.

2024-06-06 07:49:43.173591 INFO::Determining format of input files
2024-06-06 07:49:43.175754 INFO::Input format is data samples as columns and metadata samples as columns
2024-06-06 07:49:43.204545 INFO::Input format is dnadata samples as columns and metadata samples as rows
2024-06-06 07:49:43.264767 INFO::Formula for fixed effects: expr ~  time

The output doesn’t make sense to me either, since every value is time instead of the actual timepoints (30, 480 etc.).

feature	metadata	value	coef	stderr	N	N.not.0	pval	qval
L_plantarum_9_115__L_plantarum	time	time	-0.0010509	4.17e-05	12	12	1.18e-09	3.15e-06
Le_mesenteroides_22_6__Le_mesenteroides	time	time	-0.0011778	4.91e-05	12	12	1.82e-09	3.15e-06
…	…	…	…	…	…	…	…	…

I would be very thankful for help making this work.

Best regards,
Tom

YancongZhang · June 6, 2024, 5:23pm

Hi Tom,

For your first question, MTXmodel formats all the input files (i.e. feature data, taxon-RNA covariate data, metadata sample) into tables with sample ID as rows. In your case, MTXmodel first checked your feature file and metadata file, and converted metadata samples from columns to rows. Thus, in the following checking for covariate file and the formatted metadata, metadata samples were detected as rows.

For your second question, MTXmodel reports the outputs of continuous variables slightly different from category variables. Your metadata variable (i.e. time) was treated as a continuous variable. In this case, MTXmodel reported the statistical result from linear regression (i.e. regressing time-point values against feature values) in one row per feature, where the value column represents what continuous variable it is rather than each time point value.

Thanks!
Yancong

ttubb · July 28, 2024, 8:31am

Hi Yancong,

thank you again for all your help. It took me some time to get back to this project. I think I have the tool working as it should now, but would be very thankful for some advise on the usage:

For the time series I’d expect expression of individual genes to go up and down over the course of the experiment. Thus, I want to avoid modeling a linear regression spanning >2 time points. I have split up the dataset and fed in the data in pairs (e.g. 3 samples from t0 and 3 samples from t30).

Can you give me a recommendation for the parameters in this setup? Will CSS + a negative binomial model combined with taxon covariates produce sensible results?

Best regards,
Tom

YancongZhang · July 29, 2024, 8:41pm

Hi Tom,

If you’d like to examine differential expression between time points, categorizing your time series and treating this categorized variable as a fixed effect should work (e.g. time_category: t0, t30).

According to our previous evaluation using synthetic data (PMID: 34784344), we observed inconsistent behavior and many false positive findings for existing negative binomials; LM-based approaches showed a good statistical power while controlling false discoveries. If you do want to see differences between models, you could try different models and see which fit best to your expectations.

Thanks!
Yancong

ttubb · July 30, 2024, 7:43am

Hi Yancong,

I ran the different models before my last post and was surprised at how much the results differed. Thank you for the answer and the reference. I’ll just stick with TSS + LM + Taxon Covariate.

I have only used what you call ‘taxon-specific scaling’ plus DESeq in the past and originally wanted to try switching to the Feature-DNA covariate approach you suggest. But alas, I don’t have DNA count data for this current dataset. Hence the M3 approach.

Thank you again for your help & patience. As somebody who mainly does software development statistical analysis often feels like a minefield.

Cheers,
Tom

Topic		Replies	Views
RNA and DNA CPM as covariates MaAsLin	7	961	February 3, 2021
MTXmodel / Maaslin2 - Max feature number? Downstream analysis and statistics	6	170	June 27, 2024
Accessing the MGX-normalized MTX feature expression data in MTX_model? bioBakery workflows	1	149	March 5, 2024
P-value adjustment in MTX_model MaAsLin	1	113	March 29, 2024
MTX Model Coefficient to Log Fold Change bioBakery workflows	0	74	June 10, 2024

Taxon-covariate in MTXmodel

Related topics