Question about reference parameter in maaslin3

Hi,
I would like to know if I need to set the reference parameter in maaslin3. If I want to find differentially enriched bacteria associated with a disease sample (control vs disease), and at the same time I need to adjust for some other confounders (e.g. age, bmi, sex). Do I still need to set reference = c(‘type,disease’) in maaslin3? Also, I’m wondering if I want to identify differentially enriched bacteria associated with a disease sample (control vs disease) while controlling for confounders, do I just filter out the rows under the metadata column that contain a disease counterpart?
At the same time, I have some questions about the interpretation of the summary plot. After generating q_joint and p_joint values in the main file, why are there still separate abundance p(fdr), q(fdr) and prevalence p(fdr), q(fdr) in the plot notes? And where does covariates P(fdr) come from, since there is no corresponding column in the file all_result.tsv. And finally I would like to ask if the results derived from the coefficient plots and heatmaps are all relative to the results derived from them versus the control group?
Sorry for all the questions, looking forward to your answers!

Hi,

The reference should be set either by specifying reference = c('type,disease') or by setting type to be a factor variable beforehand; both do the same thing. If you set the reference as disease, each non-disease group within the type variable will be compared against samples with disease, and your coefficients will tell you how much more abundant or prevalent each species is relative to in the disease group. Assuming you actually have control and disease in type, I would tend to set control as the reference rather than disease so that you can interpret the associations as species more/less prevalent/abundant with disease.

Assuming you’re talking about the results table, to identify differentially enriched bacteria (disease vs. control groups) while controlling for confounders, you can just look at the rows of the results table with disease (or maybe type depending on how you’ve set it up) in the metadata column.

The joint p/q-values answer the question “is this association significant for either abundance or prevalence?” while the individual p/q-values answer the questions “is this association significant for abundance?” and “is this association significant for prevalence?” In general, you might care about these separately and see different prevalence vs. abundance effects, so the plot is designed to highlight that phenomenon if it occurs. The full results give you both the individual and the joint p/q-values in case you’re interested in the joint effect (though practically this is mostly used for benchmarking since all other tools just give you a single p-value).

The covariate p/q-values are also in the results table in the rows corresponding to those covariates. Are you not seeing them there? If they’re in your plot, they must be in the table (notwithstanding some sort of bug).

The coefficients and p/q-values (which are then displayed in the plots) come from whatever comparison you’ve specified in your linear model. If you’ve included the covariate type with the levels control and disease, the corresponding coefficients will be the difference between disease and control samples on average.

Will

Hi,
Thank you! I really appreciate your patient response. However, I’m still confused about the several p/q values.

Take the all_result.tsv file as an example. If pval_individual and qval_individual are significant, does that mean the association is significant for both abundance and prevalence, while a significant joint_p/q value only indicates that the association is significant for either abundance or prevalence (i.e., at least one of them)?

In the summary plot, there are corresponding P (FDR) values for both abundance and prevalence, but I still don’t know which ones in all_result.tsv they correspond to (pval_individual, qval_individual, qval_joint, pval_joint). At the same time, I still can’t find the corresponding covariates P (FDR) values in the table. Could you please help clarify these issues? Thank you so much!


The issue might be that when read in Excel (I think, from the picture), your column names are all shifted 1 column to the left for some reason. I’d check that to start. If you read them in R, they’ll be fine.

If the pval_individual or qval_individual is significant, that means that the prevalence or abundance association (whichever of the two is in the particular row, see the model column that says prevalence or abundance) is significant. All 4 combinations of abundance/prevalence significant/insignificant are possible. The joint p/q-value only indicates the association is significant for at least one of them. (For technical mathematical reasons, it’s possible that your joint p/q-value is significant without either individual p/q-value being significant or that one of the individual p/q-values is significant without the joint being significant. But almost all the time, they’ll be logically consistent.)

The p/q-value in the plot will correspond to a row for that particular feature, metadatum, and model (prevalence vs. abundance in the model column). The P(FDR) (which is the q-value) is in the qval_individual column (once you shift the column names).

Will

Thank you for your answer, I really appreciate your patiently response. And my final question is about the covariates P fdr, I still can not see the covariates P fdr in the result (all_result.tsv), what does that mean?
Thank you again for your time and support.

Can you clarify what you’re looking for? The P fdr is same thing as the q-value, which is in the table as qval_individual and qval_joint for each row. The covariates each get their own row.