Should the cailbration data contain raw, unprocessed counts?

I’m wondering what the best practice is for using a calibration file to generate synthetic real-like data.

Is it best to provide sparseDOSSA the most raw form of the microbiome 16S count data? Or is it better to provide say, a rarefied (e.g. even out sequencing depth) and QC-filtered version (e.g. low prevalent species filtered out)?

I imagine that if the most raw form of the data is provided, that post-processing of the synthetic generated data might be necessary, and that this data would also carry on any of the limitations of the calibration data. Hence, my suspicion is that a somewhat processed version might be the better scenario, but I’m not 100% sure which steps should be taken.

At the moment, I am providing a QC-filtered version (low prevalent filtered out), but unrarefied, un-normalized dataset. Meaning, sequencing depth is vastly uneven between samples. Could someone please advise? Thanks a bunch.

Edit: I have one additional question - when using a calibration file, should you expect the relative distribution of OTU features to be similar between the real and synthetic data? I’m finding that the mean abundance of the same OTU in the real (calibration file) vs. synthetic data are vastly different. I expected it to be similar.

Hi -

I’d say raw count data is preferable to rarefied. SparseDOSSA has the capability to model sequencing depth variation so using rarefied data is unnecessarily throwing away information. QC-filtering would be helpful. If a feature is extremely low prevalence, then SparseDOSSA will have difficulties estimating its mean abundance and variability.

To your last question, on why real vs. synthetic data OTUs have difference distributions: SparseDOSSA currently simulates “new” microbial features, instead of those included in the calibration file. These new features are designed to be distributed similarly as the original ones (for example, the distribution of per-feature mean relative abundance, across simulated features, should have a similar distribution as in the real data). Because of this it is difficult to compare real vs. simulated data on a per-feature basis. You could rank features in each data type by, say, their mean abundance, but still they are not guaranteed, by design, to be the same.

Hope this is helpful,

1 Like

I am interested in using Sparsedossa2 for modeling the abundance counts of the infant gut microbiome. I’m thinking of using this data set ( to model the distribution of counts.

What function would we use to do that? Does the file get read into the spike_metadata parameter in the SparseDOSSA2 function? Also, are there any examples to follow? Thanks in advance for your help!

Hi - Apologies, but we haven’t finished SparseDOSSA2 just yet. A development version should come out in the very near future though, so do stay tuned! :slight_smile: