Inquiry for the usage of sparseDOSSA package

I am working on the microbiome DA analysis and a reviewer of my recently submitted paper advised me to use the sparseDOSSA package to benchmark different DA analysis methods including mine.

However, it is kind of confusing for me to use your package to achieve my desired simulation goals. Basically, I want to simulate the data with a portion of taxa to be DA under two conditions. In my case, the very basic set up is that, for example, 10% of the total taxa are DA for the experiment group vs control group. An indicator vector of “group1” or “group2” indicates which group each sample belongs to. I found that the sparseDOSSA package will generate at least 4 covariates and I am not sure how to make the DA taxa only relevant to the binary feature.

Could you please point to me how to set the parameters to generate the simulated data so that 10% of the taxa are DA under two conditions? Also, could you also point to me how I can identify which taxa are truly DA under these two conditions?


First please make sure to have the most updated version of SparseDOSSA installed: devtools::install_github("biobakery/sparseDOSSA"). You might need to install devtools if not already available.

To specify 10% taxa being associated is straightforward - just set percent_spiked = 0.1 in your function call.

SparseDOSSA by default simulates metadata columns with both continuous and binary distributions. To bypass this, the user will have to provide their own metadata, in a matrix (rows metadata, columns samples) format. This can then be provided for the UserMetadata parameter during function call.

So an example of using SparseDOSSA to simulate an OTU table with 100 features, 100 samples, with 10% of the features associated with a binary metadata (~50% cases, 50% controls):

# create metadata matrix, one row of 100 0/1 per-sample values indicating case/control
metadata <- matrix(rbinom(n = 100, size = 1, prob = 0.5), nrow = 1, ncol = 100)
# run sparseDOSSA
simulated_data <- sparseDOSSA(number_features = 100, number_samples = 100, percent_spiked = 0.1, UserMetadata = metadata)

The simulated results, simulated_data is a list. simulated_data$OTU_count[[1]] and simulated_data$OTU_norm[[1]] corresponds to the simulated count and relative abundance tables (rows are features and columns samples). There are additional rows in this table. Specifically, the rows correspond to:

  • First row is sample ID
  • The next n_metadata rows are the metadata values, one row for each simulated or user-provided metadata. In our case we’d have exactly one row correspond to the provided binary variable.
  • The next n_feature rows are the “null” feature values, one row for each feature (100 in our case). These are the simulated log-normal microbial abundances, without any additional spiking-in.
  • The next n_feature rows are the “null” feature values, but with outliers added in.
  • Finally, the last n_feature rows are the final output. These are simulated feature abundances with spiked-in metadata association.

These information are somewhat self-explanatory in the first column of this table. Using our run as an example, simulated_data$OTU_count[[1]] and simulated_data$OTU_norm[[1]] will both be matrix of 302 rows and 101 columns. First column and row are feature/sample names. The last 100 rows correspond to simulated metadata-associated microbial abundances.

simulated_data$truth[[1]] has additional information for the spiking-in. Importantly, this includes the feature names that are actually associated with the metadata (true positives). Again, the information is pretty self-explanatory and I’d suggest taking a look to understand better. But specifically, names of these features can be extracted via grep("Feature_spike_n_", simulated_data$truth[[1]][, 1], fixed = TRUE, value = TRUE).



Thank you so much for the detailed reply. I will try your suggested approaches.


Dear Siyuan,

When I run the code you provided, it results in an error.

simulated_data <- sparseDOSSA(number_features = 100, number_samples = 100, percent_spiked = 0.1, UserMetadata = metadata) :
Error in sparseDOSSA(number_features = 100, number_samples = 100, percent_spiked = 0.1,  : 
  unused argument (UserMetadata = metadata)

When I check the documentation, there is also no parameter called UserMetadata in the Usage section of the function, it only exists in the Arguments section.

R.version: 3.6.3
sparseDOSSA version: 0.99.6
Could you please check on this?

Hi -
Did you install SparseDOSSA through Bioconductor? The version with functionality to specify metadata actually lives on GitHub. Could you follow the instructions above to install this version? If it still doesn’t work let me know!


I have installed SparseDOSSA with your command but have the following warning message in installing the package

Downloading GitHub repo biobakery/sparseDOSSA@master
biobakery-sparseDOSSA-7fe4576/ Can't create '\\\\?\\C:\\Users\\ruoch\\AppData\\Local\\Temp\\Rtmpojr5M2\\remotes49582d57251\\biobakery-sparseDOSSA-7fe4576\\'
tar.exe: Error exit delayed from previous errors.

This time the function will run but with this warning message:

> simulated_data <- sparseDOSSA(number_features = 100, number_samples = 100, percent_spiked = 0.1, UserMetadata = metadata)
Parameters BEFORE Calibration File
Length exp NA Length vdMu NA length vdSD NA length vdPercentZero NA Read depth 8030
Parameters AFTER Calibration File (if no calibration file is used, defaults are shown)
Length exp 1 Length vdMu 1 length vdSD 1 length vdPercentZero 1 Read depth 8030 Feature Count 100
func_generate_random_lognormal_matrix START
func_generate_random_lognormal_matrix: START Making features
stop func_generate_random_lognormal_matrix
start func_generate_random_lognormal_with_outliers
Stop func_generate_random_lognormal_with_outliers
start func_generate_random_lognormal_with_multivariate_spikes
stop func_generate_random_lognormal_with_multivariate_spikes
Warning message:
In sparseDOSSA(number_features = 100, number_samples = 100, percent_spiked = 0.1,  :
  number of associations = 0, and no spike file specified; no bug-bug spike-ins will be done.

Does this mean no spike-ins will be introduced?



I haven’t seen the error message before, but it seems to be related to the readme file and shouldn’t affect the package itself. The message about no spike-in is fine. SparseDOSSA has a functionality to also spike in associations between bug pairs and it’s just saying that this wasn’t performed. If you want to make sure, one good sanity check is to examine the true positive associated features, and see if they are actually associated with the metadata.