MTX Synthetic Identifying True Positives

Kat_Terwelp · April 15, 2025, 3:08pm

Hello! I’m interested in using the simulated mtx and mgx data from the MTX 2021 paper.

The null datasets are easy to use since all values should be negative. However, I’d like to test the sensitivity of my pipelines using your *true simulations.

How do I know which feature groups should be identified as positive?

For example, in the 2021 paper it mentions that ~10% of transcripts should be associated with the phenotype for true datasets. However, I found nearly 99% of features were spiked for some simulations according to the spiked tsvs. I tried to parse the spiked levels but the data was a bit confusing to me. When summed across groups of simulations + feature groups, most of the spikes center around 0 totals.

Thanks for your help! I didn’t see this question asked before, but if it has been asked, please send me a link and I’ll close this topic.

franzosa · April 21, 2025, 6:11pm

For a given _abunds.tsv file, if there’s a corresponding _spiked.tsv file, then the latter contains a list of features in the former file that were spiked with a phenotype association along with a 1 or -1 sign (for direction). I was just checking a few of them and I was seeing 10x the rows in the abunds file in comparison with the corresponding spiked file, consistent with the 10% spike rate. Can you cite an example where you’re seeing 99%?

I’m also not super clear on what you’re computing in the histograms? The answer there might help to explain some of the confusion around the files.

Kat_Terwelp · April 23, 2025, 2:08am

Hello Eric!

I realized my confusion. I was separating the spike files based on whether they contained “mgx”, “mtx”, or “bug” spikes but I didn’t realize there were also separate “spiked_groups” versions.

I ran regressions on the group_abund files, but I was using the non-group spikes for positive controls. This is what I plotted and why I found ~99% of groups being spiked. If I compare the results between group and spiked_groups, I get the expected 10% spike rate.

Thank you for your response! I figured I was misunderstanding something since I wasn’t seeing the paper’s spike rate.

franzosa · April 23, 2025, 10:30pm

No worries - it took me a few minutes to reorient myself to the file structure and contents and I’m the one that made them… There are a lot of very similarly named files, so the confusion is understandable.

The “groups” were a separate thing where we spiked signals at the level of orthologous gene families (which might be present across multiple species) and then asked if we could recover their community-level differences (e.g. in a case where we’re not able to stratify by species).

Topic		Replies	Views
MTXmodel / Maaslin2 - Max feature number? Downstream analysis and statistics	6	177	June 27, 2024
Metabolite prediction using 16S taxonomic abundance data MelonnPan	1	513	January 16, 2020
Maaslin2 all-groups outcome different from paired-groups MaAsLin	2	338	November 2, 2023
Accessing the MGX-normalized MTX feature expression data in MTX_model? bioBakery workflows	1	149	March 5, 2024
Metatranscriptomics of enriched cultures from a stool HUMAnN	1	102	April 11, 2024

MTX Synthetic Identifying True Positives

Related topics