Spurious and indirection correlations pairwise halla?

Dear all

I have performed multiple pairwise halla runs between my taxonomic (e.g. kraken2 data), functional (humann3 data), metabolite data, and an environmental measurement dataset. I managed to get a large number of significant correlations after residualizing my datasets.

However, I can’t help but think that some of the correlations are spurious or indirect correlations.

For example, if my hypothesis is that exposure to the said environment variable results in an association with a microbial function, and in turn that function results in the synthesis of the metabolite, then the significant association between the environmental variable and the metabolite should not be there. I notice that in my dataset, there are quite a number of correlations where the edges between the environmental variable, the function, and the metabolite form a triangle in a network representation plot.

Is there a way that, after the correlation output to identify such spurious correlations and remove them from the results? Thank you very much

Kind regards

Marcus

Hi Marcus. HAllA is aimed at identifying associations between a pair of high-dimensional datasets. Looking at the network structure between more than two datasets and accounting for some known or unknown influence structure between them is unfortunately beyond HAllA’s scope.

Hi Andrew thanks for the response.

In this case, how did the authors of the manuscript below account for any potential spurious indirect correlations by their pairwise HAllA analyses?

Thanks :slight_smile:
Marcus

Lloyd-Price et al. 2019. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases
https://www.nature.com/articles/s41586-019-1237-9

I believe in that instance they are accounting for additional covariates by first regressing them out using a mixed-effects model. Then HAllA was run on the matrices of residuals instead of the matrices of raw data.

So in your example, you’d run a mixed effects model to regress out the effect of the environment variable that you think might cause confounding.

Given how dataset-specific this modelling would need to be, HAllA doesn’t include the functionality to run the model and extract the residuals. You can see more discussion about this type of process in this thread: