Hi! I’m trying to simulate microbial data with my own specified correlation matrix. I was thinking that I could simulate data using the “Stool” template, fit the data, overwrite the appropriate matrix, and then use the fitted data to re-simulate some data, now with my specified correlation matrix. My first question is, will this work or is there an easier way to do what I’m hoping to accomplish? If so, what’s the appropriate matrix I need to overwrite? From what I understand, the correlation between microbes is controlled by Omega. However, Omega isn’t bound between -1 and 1 so I’m guessing it’s a covariance matrix. My guess would be that Corr_star is the one I’m looking for. But what’s the difference between sigma, Sigma, Omega, and Corr_star?
Hi - thanks for the interests!
-
You shouldn’t need to fit the data again. Just access the
Stool
fit object, which is stored internally in SparseDOSSA. You can do this with e.g.Stool_mod <- SparseDOSSA2:::Stool
. -
You can then simply modify with your own correlation matrix. The one you’re looking for is
Sigma
(and correspondingly,Omega
).Sigma
is the correlation matrix andOmega
is its inverse (i.e. the precision matrix).sigma
is per-feature standard error.Corr_star
is something SparseDOSSA uses internally when estimatingSigma
andOmega
. So you’d do something like
a.Stool_mod$EM_fit$fit$Sigma <- My_own_corr
b.Stool_mod$EM_fit$fit$Omega <- solve(My_own_corr)
-
You can then simulate data with
SparseDOSSA2(template = Stool_mod, new_features = FALSE, ...)
. Thenew_features = FALSE
setting is important to ensure the new correlation matrix is propagated.
Thank you so much for the quick and clear response! That makes a lot of sense now.
As for #3, is there a way to specify the number of features without specifying new_features = TRUE? It seems the default is 332 if new_features = FALSE but I’d like 1,000 features instead.
This can be done but needs to call SparseDOSSA’s internal functions. To generate “new” features that “look like” the true 332 stool features, SparseDOSSA fits a three-dimensional density estimator on the pi0, mu, and sigma values of each feature. This fit is stored internally in the F_fit
component.
To use this for your case, I’d simulate 1,000 features, and use their pi0, mu, and sigma to replace those currently stored in Stool
, just as the replacement for Sigma
and Omega
above.
- First, simulate 1,000 new futures:
Stool <-
Stool_mod <-
SparseDOSSA2:::Stool
feature_param_template <- cbind("pi0" = Stool$EM_fit$fit$pi0,
"mu" = Stool$EM_fit$fit$mu,
"sigma" = Stool$EM_fit$fit$sigma)
feature_param <-
SparseDOSSA2:::generate_featureParam(
F_fit = Stool$F_fit,
param_original = feature_param_template,
n_feature = 1000)
- Use the simulated
pi0
,mu
,sigma
to replace those stored inStool_mod
:
Stool_mod$EM_fit$fit$pi0 <- feature_param[, "pi0"]
Stool_mod$EM_fit$fit$mu <- feature_param[, "mu"]
Stool_mod$EM_fit$fit$sigma <- feature_param[, "sigma"]
Stool_mod$EM_fit$fit$Sigma <- My_own_corr # note the dimensionality must agree here
Stool_mod$EM_fit$fit$Omega <- solve(My_own_corr)
Stool_mod
can then be used for simulation just as above. I’ll consider adding an exported function for this in the next update.
Thank you! I was able to get it running very easily. Much appreciated!
Hi,
I’m using the code you wrote to specify my own correlation matrix. However, I changed pi0 to be a vector of zeros so I could remove sparsity. Then I use:
stool ← SparseDOSSA2(template = Stool_mod, n_sample = 300, new_features = FALSE)
If I specify my own correlation matrix, the correlation matrix of the simulated data seems biased towards positive correlations. [cor(t(stool$simulated_data))] I cannot seem to get a negative correlation from the simulated data even if I specify negative correlations in Sigma. Do you know why this might be?
Also, when I look at the inverse of stool$params$Omega, I get something different than what I specify to Sigma. Do you know why this is?
Hi,
-
Only Omega is used internally. SparseDOSSA does not check if Sigma and Omega are inverse of each other. If you intend to provide your own Sigma, you’ll need to make sure that its inverse is updated as Omega as well. I committed an update to check that Sigma and Omega are inverses of each other. If they are not an error is reported.
-
Another thing: Sigma does not directly agree with the correlation of simulated_data. Rather, it is the correlation matrix for stool$simulated_matrices$a_null, which is then renormalized and sampled with multinomial sampling to generate simulated_data. To see that the code is working, you should first normalize the count values in simulated_data to relative abundances - it does not make sense to consider correlation while there are technical variabilities due to sequencing depth variation. Even there you shouldn’t expect to see exactly the same correlation (due to the renormalization from a_null explained above), but in our experience the two should agree reasonably well.
Thank you so much! This was very helpful!