I’m wondering what the best practice is for using a calibration file to generate synthetic real-like data.
Is it best to provide sparseDOSSA the most raw form of the microbiome 16S count data? Or is it better to provide say, a rarefied (e.g. even out sequencing depth) and QC-filtered version (e.g. low prevalent species filtered out)?
I imagine that if the most raw form of the data is provided, that post-processing of the synthetic generated data might be necessary, and that this data would also carry on any of the limitations of the calibration data. Hence, my suspicion is that a somewhat processed version might be the better scenario, but I’m not 100% sure which steps should be taken.
At the moment, I am providing a QC-filtered version (low prevalent filtered out), but unrarefied, un-normalized dataset. Meaning, sequencing depth is vastly uneven between samples. Could someone please advise? Thanks a bunch.
Edit: I have one additional question - when using a calibration file, should you expect the relative distribution of OTU features to be similar between the real and synthetic data? I’m finding that the mean abundance of the same OTU in the real (calibration file) vs. synthetic data are vastly different. I expected it to be similar.