All associations had errors or were insignificant

Hello, I’m doing a maaslin3(version 1.0.0) analysis and my results show “All associations had errors or were insignificant”。 My code and data and result file
all_results (1).tsv (4.1 MB)
are as follows:
library(maaslin3)
packageVersion(“maaslin3”)
ph_TAN ← read.csv(“/data/work/QTP/maaslin3/TAN/Maaslin3_qtp_TA_255sample.csv”, head = TRUE, row.names = 1,check.names=FALSE)

SGB_TAN ← read.csv(“/data/work/QTP/maaslin3/TAN/qtp_TA_255sample_relabu.csv”, head = TRUE, row.names = 1,check.names=FALSE)

SGB_TAN ← SGB_TAN / 100
output_dir ← “/data/work/QTP/maaslin3/TAN/elevation_group/SGB”
input_TAN_data = as.data.frame(SGB_TAN)
input_TAN_metadata = as.data.frame(ph_TAN)
fit_out ← maaslin3(input_data = input_TAN_data,
input_metadata = input_TAN_metadata,
output = output_dir,
fixed_effects = c(“elevation_group”),
reference = c(“elevation_group,4-4.5KM”),
random_effects = c(“Latitude”, “Longitude”, “sampled_place”, “KP994558.1_s_Potentilla_acaulissta”, “KJ020646.1_s_Potentilla_parvifoliasta”, “MH854502.1_s_Fagopyrum_dibotryssta”, “HE577530.1_s_Chenopodium_hybridumsta”, “FJ640034.1_s_Leontopodium_pusillumsta”, “AB480625.1_s_Orostachys_fimbriatasta”, “KU750607.1_s_Senecio_scandenssta”),
normalization = “TSS”,
transform = “LOG”,
warn_prevalence = TRUE,
augment = TRUE,
standardize = TRUE,
max_significance = 0.1,
median_comparison_abundance = TRUE,
median_comparison_prevalence = FALSE,
max_pngs = 250,
cores = 10)
qtp_TA_255sample_relabu.csv (9.8 MB)
Maaslin3_qtp_TA_255sample.csv (29.7 KB)
Can you tell me what is the reason for this problem?
Best wishes

Hi,

I’m pretty sure the issue is in how you’re using the random effects. I would recommend only using 1 random effect if any. Looking at your data, I would recommend using elevation_group as a fixed effect and sampled_place as a random or fixed effect. It looks like sampled_place perfectly determines all the other variables you have in your random effects, so if you just want to control for those but not analyze anything about them in particular, using sampled_place is sufficient. In fact, those other variables are so highly collinear that I don’t think a model will even fit with both those variables and sampled_place included.

Will

Thank you for your answer, in addition to sample_place as a random variable, I also want to make feeding habits (what these hosts eat) as a random variable, for example, “YES” in the KP994558.1_s_Potentilla_acaulissta means that the host ate this plant, and “NO” does not eat it, so how do I add this random variable to the analysis?
Best wishes

If you want to include feeding habit, I would include it in fixed_effects since I suspect you’ll run into fitting issues with multiple random_effects, especially when sample_place completely explains the feeding habit (from what it looked like in the data). I would also include each feeding habit in a separate model if you’re going to include them since it looks like all the feeding habits together will be perfectly collinear with your main effect and therefore result in one or the other not being estimated.

Will

Thank Will. sample_place is not my focus, my focus is to introduce eating habits as a random variable, if I merge the random variable of feeding habit into a diet_group (as shown in the figure below), and use one such input as a random variable for feeding habit, do you think it’s okay?

To clarify, whatever you use as a random effect (whether it’s diet_group or sample_place) will not show up in your results table because random effects are only to control for grouping. Based on your data, I think each sample_place had a unique combination of diet items, so making one categorial variable out of each diet combination will likely give you the exact same result as if you had only included sample_place.

Hello, Will. I took your suggestion and I used sample_place as a random variable. But there are some doubts about the choice of parameters, and my data is relative abundance % (the range is 0-100 for each species, and the sum for each sample is 100). The parameters of my two tests are as follows:

test 1 :
output_dir ← “/data/work/QTP/maaslin3/TAN/elevation_group/test1”
fit_out ← maaslin3(input_data = input_TAN_data,
input_metadata = input_TAN_metadata,
output = output_dir,
fixed_effects = c(“elevation_group”),
reference = c(“elevation_group,4-4.5KM”),
random_effects = c(“sampled_place”),
normalization = “NONE”,
transform = “NONE”,
warn_prevalence = FALSE,
augment = TRUE,
standardize = FALSE,
max_significance = 0.1,
median_comparison_abundance = TRUE,
median_comparison_prevalence = FALSE,
max_pngs = 250,
cores = 5)

test 2 :
output_dir ← “/data/work/QTP/maaslin3/TAN/elevation_group/test2”
fit_out ← maaslin3(input_data = input_TAN_data,
input_metadata = input_TAN_metadata,
output = output_dir,
fixed_effects = c(“elevation_group”),
reference = c(“elevation_group,4-4.5KM”),
random_effects = c(“sampled_place”),
normalization = “TSS”,
transform = “LOG”,
warn_prevalence = TRUE,
augment = TRUE,
standardize = TRUE,
max_significance = 0.1,
median_comparison_abundance = TRUE,
median_comparison_prevalence = FALSE,
max_pngs = 250,
cores = 5)

In test 2, the number of abundance-related features in the significant_results is more than in test 1. So what parameters should I choose?

I would use the test 2. TSS shouldn’t change anything if your data are already percents (it converts them to proportions which is just scaling by a constant). Log transforming is likely driving the difference, and I would always log transform since the linear models fit the data much better on the log scale (and you also get beneficial properties with the median comparison etc.)

Will

Hi, I checked my data, test 1 found more associated species, my data in the abundance of species part of 0 value is a lot, in this case, do you also recommend ues test 2?

Hi, I didn’t follow what you were saying in that message, but I would essentially always recommend test 2 (TSS and LOG) over test 1 (NONE and NONE).

Thanks will, I took your suggestion to use TSS and LOG.
I now have a new problem: the previous analysis was based on the relative abundance of species, and the sum of each sample was 100 (%), and I now have data on the abundance of the ko gene (use TPM), and the sum of each sample in these abundances is not consistent, and the abundance of each ko is ranging from 0-1000. Should TSS and LOG be used for analysis?

Are these the outputs of HUMAnN? Anything that’s per-gene and per-read normalized can be converted into relative abundance.

Will