Finding out which genes drive varation seen between conditions

Hello @leonard.dubois
I have WMS-reads from a clinical setting that involves healthy and diseased persons, whereby diseased is differentiated in different stages: healthy, dis_stage1, dis_stage2.

I profiled strains of S. aureus (and other species) and performed PCA. I observe differences between strains in the healthy and diseased condition. Now, I would like to find out which genes are responsible for this variation. Do you have any advice for doing so?

I already tried to fit a glm of each gene against the condition-variable (healthy, dis_stage1, dis_stage2).

> glm(gene.matrix ~ grouping.variable, family = binomial(link = "logit"))

but with that I only get 2 results with a p.value > 0.05, which I find a bit odd in a matrix of 3500 genes. Do you have another idea how to extract the variables that drive variation?



well, that is a tough question but a really fascinating one. I did try to tackle it but haven’t come out with a perfect solution yet.

First, any kind of test done on the whole set of gene will be highly impacted by multiple test correction. One first simple way of reducing that would be to cut the top and bottom prevalent gene in the matrix (you might have already done that). For example not considering all the genes present in 100% to 95 or 90% of your samples as they would be core or soft/extended core genome. And the other way around, depending of your number of samples, maybe filtering out the genes present in only 1% (or more) of the samples.

Have you tried doing some Fisher or chi-squared tests? For example building a contingency table with:
healthy - nb of sample with gene present - nb of sample with gene absent
disease - nb of sample with gene present - nb of sample with gene absent
This will actually look for the significant changes in prevalence across conditions.

Also sometimes I find it better to consider the odd ratio rather than the p-value…


Hello @leonard.dubois

thanks a lot for your comprehensive answer. I tried what you suggested and builded a contingency table and running a chi-square test. But how can I no conclude which genes are differentially present between the conditions?
Did I build the table right?
For the chi-square test I called:

A tibble: 1 × 6
n statistic p df method p.signif
int dbl dbl int chr chr
1 1200 857. 1.83e-11 599 Chi-square test ****

Another method that came to my mind is to use DAVID Ontology and run the samples against it. Do you think that is reasonable?


Ontology can be relevant to look for enriched functions indeed, and might be easier to use than raw UniRef90 annotation. I personally use KEGG Orthology ID (KO) or COG IDs from the pangenome annotation file that is provided in PanPhlAn pangenomes.

For the test I usually have a contingency matrix with 4 cells (same template as in my previous message). So each gene is tested separately but then multiple correction can be harsh…