Hi,
First please make sure to have the most updated version of SparseDOSSA installed: devtools::install_github("biobakery/sparseDOSSA")
. You might need to install devtools
if not already available.
To specify 10% taxa being associated is straightforward - just set percent_spiked = 0.1
in your function call.
SparseDOSSA by default simulates metadata columns with both continuous and binary distributions. To bypass this, the user will have to provide their own metadata, in a matrix (rows metadata, columns samples) format. This can then be provided for the UserMetadata
parameter during function call.
So an example of using SparseDOSSA to simulate an OTU table with 100 features, 100 samples, with 10% of the features associated with a binary metadata (~50% cases, 50% controls):
library(sparseDOSSA)
# create metadata matrix, one row of 100 0/1 per-sample values indicating case/control
metadata <- matrix(rbinom(n = 100, size = 1, prob = 0.5), nrow = 1, ncol = 100)
# run sparseDOSSA
simulated_data <- sparseDOSSA(number_features = 100, number_samples = 100, percent_spiked = 0.1, UserMetadata = metadata)
The simulated results, simulated_data
is a list. simulated_data$OTU_count[[1]]
and simulated_data$OTU_norm[[1]]
corresponds to the simulated count and relative abundance tables (rows are features and columns samples). There are additional rows in this table. Specifically, the rows correspond to:
- First row is sample ID
- The next
n_metadata
rows are the metadata values, one row for each simulated or user-provided metadata. In our case we’d have exactly one row correspond to the provided binary variable.
- The next
n_feature
rows are the “null” feature values, one row for each feature (100 in our case). These are the simulated log-normal microbial abundances, without any additional spiking-in.
- The next
n_feature
rows are the “null” feature values, but with outliers added in.
- Finally, the last
n_feature
rows are the final output. These are simulated feature abundances with spiked-in metadata association.
These information are somewhat self-explanatory in the first column of this table. Using our run as an example, simulated_data$OTU_count[[1]]
and simulated_data$OTU_norm[[1]]
will both be matrix of 302 rows and 101 columns. First column and row are feature/sample names. The last 100 rows correspond to simulated metadata-associated microbial abundances.
simulated_data$truth[[1]]
has additional information for the spiking-in. Importantly, this includes the feature names that are actually associated with the metadata (true positives). Again, the information is pretty self-explanatory and I’d suggest taking a look to understand better. But specifically, names of these features can be extracted via grep("Feature_spike_n_", simulated_data$truth[[1]][, 1], fixed = TRUE, value = TRUE)
.
Best,
Siyuan