Can I use the tool for taxon-RNA covariate analysis by providing as --input_dnadata a file where each feature/sample is assigned the TPM sum of the related taxon in the sample? If not, is there another source for the code that was used to produce the results for models 2 and 5?
You can use MTXmodel to calculate DE using taxon-RNA covariate as well. In this case, your RNA features and their corresponding source taxon should be matched. For example, you could format your input files as:
RNA feature file:
ID sample1 sample2 …
Gene1__Taxon1 value value
Gene2__Taxon1 value value
Gene3__Taxon2 value value
…
Taxon-RNA covariate file:
ID sample1 sample2 …
Taxon1 value value
Taxon2 value value
…
Thank you for the quick response! How would I feed this Taxon-RNA covariate file into the MTXmodel tool? The only argument that is documented for covariate input is --input_dnadata. And that one seems to require data on the level of individual gene IDs.
The taxon-RNA covariate is feeded to --input_dnadata. In this case, taxon-RNA acts as a proxy for taxon abundance when MGX data is unavailable, with taxon abundance offering an estimate of the gene copy number per feature, assuming consistent encoding.
once again, thank you for getting back to me so fast. I have preprocessed my dataset to create the inputs as you described. For testing, I am using only a subset of the data with a single metadata variable (time).
My count data file looks like this (prefixes being taxa names is just an artifact of the prior analysis pipeline)
ID
control_0_A
control_0_B
control_0_C
Pea_30_A
Pea_30_B
Pea_30_C
Pea_480_A
…
L_plantarum_1_2__L_plantarum
1341.0
1095.0
1024.0
15.0
25.0
24.0
279.0
…
L_plantarum_1_3__L_plantarum
441.0
379.0
363.0
8.0
6.0
6.0
80.0
…
L_plantarum_1_4__L_plantarum
204.0
172.0
142.0
5.0
4.0
5.0
35.0
…
…
…
…
…
…
…
…
…
…
The taxon-RNA covariate data fed as --input_dnadata is formatted like this:
Inspecting the log, it looks like detection of the input format is going wrong. metadata samples is columns, then rows.
2024-06-06 07:49:43.173591 INFO::Determining format of input files
2024-06-06 07:49:43.175754 INFO::Input format is data samples as columns and metadata samples as columns
2024-06-06 07:49:43.204545 INFO::Input format is dnadata samples as columns and metadata samples as rows
2024-06-06 07:49:43.264767 INFO::Formula for fixed effects: expr ~ time
The output doesn’t make sense to me either, since every value is time instead of the actual timepoints (30, 480 etc.).
feature
metadata
value
coef
stderr
N
N.not.0
pval
qval
L_plantarum_9_115__L_plantarum
time
time
-0.0010509
4.17e-05
12
12
1.18e-09
3.15e-06
Le_mesenteroides_22_6__Le_mesenteroides
time
time
-0.0011778
4.91e-05
12
12
1.82e-09
3.15e-06
…
…
…
…
…
…
…
…
…
I would be very thankful for help making this work.
For your first question, MTXmodel formats all the input files (i.e. feature data, taxon-RNA covariate data, metadata sample) into tables with sample ID as rows. In your case, MTXmodel first checked your feature file and metadata file, and converted metadata samples from columns to rows. Thus, in the following checking for covariate file and the formatted metadata, metadata samples were detected as rows.
For your second question, MTXmodel reports the outputs of continuous variables slightly different from category variables. Your metadata variable (i.e. time) was treated as a continuous variable. In this case, MTXmodel reported the statistical result from linear regression (i.e. regressing time-point values against feature values) in one row per feature, where the value column represents what continuous variable it is rather than each time point value.
thank you again for all your help. It took me some time to get back to this project. I think I have the tool working as it should now, but would be very thankful for some advise on the usage:
For the time series I’d expect expression of individual genes to go up and down over the course of the experiment. Thus, I want to avoid modeling a linear regression spanning >2 time points. I have split up the dataset and fed in the data in pairs (e.g. 3 samples from t0 and 3 samples from t30).
Can you give me a recommendation for the parameters in this setup? Will CSS + a negative binomial model combined with taxon covariates produce sensible results?
If you’d like to examine differential expression between time points, categorizing your time series and treating this categorized variable as a fixed effect should work (e.g. time_category: t0, t30).
According to our previous evaluation using synthetic data (PMID: 34784344), we observed inconsistent behavior and many false positive findings for existing negative binomials; LM-based approaches showed a good statistical power while controlling false discoveries. If you do want to see differences between models, you could try different models and see which fit best to your expectations.
I ran the different models before my last post and was surprised at how much the results differed. Thank you for the answer and the reference. I’ll just stick with TSS + LM + Taxon Covariate.
I have only used what you call ‘taxon-specific scaling’ plus DESeq in the past and originally wanted to try switching to the Feature-DNA covariate approach you suggest. But alas, I don’t have DNA count data for this current dataset. Hence the M3 approach.
Thank you again for your help & patience. As somebody who mainly does software development statistical analysis often feels like a minefield.