Hello! I’m interested in using the simulated mtx and mgx data from the MTX 2021 paper.
The null datasets are easy to use since all values should be negative. However, I’d like to test the sensitivity of my pipelines using your *true simulations.
How do I know which feature groups should be identified as positive?
For example, in the 2021 paper it mentions that ~10% of transcripts should be associated with the phenotype for true datasets. However, I found nearly 99% of features were spiked for some simulations according to the spiked tsvs. I tried to parse the spiked levels but the data was a bit confusing to me. When summed across groups of simulations + feature groups, most of the spikes center around 0 totals.
Thanks for your help! I didn’t see this question asked before, but if it has been asked, please send me a link and I’ll close this topic.
For a given _abunds.tsv
file, if there’s a corresponding _spiked.tsv
file, then the latter contains a list of features in the former file that were spiked with a phenotype association along with a 1 or -1 sign (for direction). I was just checking a few of them and I was seeing 10x the rows in the abunds
file in comparison with the corresponding spiked
file, consistent with the 10% spike rate. Can you cite an example where you’re seeing 99%?
I’m also not super clear on what you’re computing in the histograms? The answer there might help to explain some of the confusion around the files.
1 Like
Hello Eric!
I realized my confusion. I was separating the spike files based on whether they contained “mgx”, “mtx”, or “bug” spikes but I didn’t realize there were also separate “spiked_groups” versions.
I ran regressions on the group_abund files, but I was using the non-group spikes for positive controls. This is what I plotted and why I found ~99% of groups being spiked. If I compare the results between group and spiked_groups, I get the expected 10% spike rate.
Thank you for your response! I figured I was misunderstanding something since I wasn’t seeing the paper’s spike rate.
No worries - it took me a few minutes to reorient myself to the file structure and contents and I’m the one that made them…
There are a lot of very similarly named files, so the confusion is understandable.
The “groups” were a separate thing where we spiked signals at the level of orthologous gene families (which might be present across multiple species) and then asked if we could recover their community-level differences (e.g. in a case where we’re not able to stratify by species).