Statistical Approach for Comparing Functional Potential of Selected SGBs vs. Community Background

I have a set of metagenomic samples that have been processed using BioBakery tools.

Let’s say I’ve identified several SGBs of interest. I’d like to compare their genetic potential (pathways/EC’s) to that of the rest of the microbial community (i.e., all other SGBs). What would be an appropriate statistical approach to test this?

I assume it’s important to account for differences in SGB’s relative abundances - if the SGBs of interest are less abundant overall, I would naturally expect them to contribute less to a given functional feature.

To complicate things a bit more, the groups are unbalanced: I’m comparing ~40 SGBs of interest against >1000 other SGBs. What would be a good way to address both the abundance weighting and the imbalance in group size?

Thanks!

To answer “is the genetic potential of these groups different?”, Anvi’o offers a statistical test for comparing the prevalence of functional modules (sometimes a gene, sometimes a pathway) between two groups ( anvi-compute-functional-enrichment [program] ). Imbalance is automatically accounted for, and abundance (read mapping) does not matter (though you can get the information along the way).

1 Like

The most basic version of what you’re describing is a 2x2 contingency table: 1) bug of interest yes/no vs. 1) bug with function X yes/no. You can build that table for each function and then evaluate the statistical significance of the overlap with a Fisher’s Exact Test (for example).

You could also model a scenario like that as logistic regression, which I believe is what the approach referenced above is doing under the hood. An advantage of logistic regression is that you can build in additional covariates of interest, e.g. sample (or mean) abundance. Something like…

glm( A ~ B + C, family = binomial(link = "logit"), data )

Where A = “bug has function X” (binary), B = “bug is of interest” (binary), and C = “bug abundance.” With the caveat that logistic regression can sometimes be flaky if the data aren’t nicely structured.

1 Like

That’s very helpful. Thank you both!