The bioBakery help forum

Setting reference levels with 2 factors

Hi !

I am discovering the MaAsLin2 package.

I am doing a comparison of the metagenome of healthy volunteers (named CT) versus patients (named AML) using R and the latest release version of Maaslin2 (updated this morning). I would like to have the plots with the CT placed in the first column and the AML placed in the second column.

I followed the instructions found here.
https://huttenhower.sph.harvard.edu/maaslin/
Setting reference levels

But despite having the factors set at the right level, the output is still not correct.

Here are my R lines

mapping_T0$Group_ordered = factor(mapping_T0$Group, levels=c(“CT”,“AML”))

fit_data = Maaslin2(input_data = data.filter2, input_metadata = mapping_T0, output = “Maaslin2_with_filters”, fixed_effects = c(“Group_ordered”))

str(mapping_T0$Group_ordered)
Factor w/ 2 levels “CT”,“AML”: 2 2 2 2 2 2 2 2 2 2 …

I saw the reference option here GitHub - biobakery/Maaslin2: MaAsLin2: Microbiome Multivariate Association with Linear Models and here Reference values for a categorical fixed effect variable but it seems to work only with factors of more than 2 levels.

I looked at similar posts, and didn’t find a solution. I hope I didn’t miss it.

Thanks for your input !
Looking forward to learning more about the package

Hi @laure_bindels,

You are correct on the behavior of MaAsLin a variable with only two reference levels will not allow for the reference to be set and will ignore the set ordered factor and instead use an alphabetical application. We have done this since the numbers will not change if you swap the reference level in this case, only the sign will change. To test this you could do something like add “a_” in front of your CT group, which should force that group to be the reference level.

Sorry if this is confusing!! I will make a note for a future version of the package to allow references even with 2 level factors.

Best,
Kelsey

Dear Kelsey,

Thanks for your swift reply. I was interested to have the graphes ready to use for presentations and publications as I saw in the documentation that data points are plotted after normalization, filtering, and transform, and I didn’t know how to retrieve these values. I will proceed as adviced and edit labels later on with an overlay.

Many thanks again for your answer,
Best,

Laure

Dear Kelsey,

I saw in the documentation that data points are plotted after normalization, filtering, and transform. Is there a way to retrieve these values ? powerpoint editing will not work to assign the colors and other characterictics of the other graphs I generated in R for this cohort. :smiley:

Many thanks for your help !
Best,

Laure

Hi Laure - I just checked to confirm and the scatter/box plots are made on the raw (filtered data), e.g. those plots are made before MaAsLin does normalization and transformation. Thus, if you want to replicate you can simply take the row/column from your feature table and replicate the plot in ggplot2 geom_point or ggplot2 geom_boxplot.

I hope this helps!

Best,
Kelsey

Hi Kesley,

Thanks for your reply and checking this. I ran Maaslin2 with lm and cplm (all parameters are the same, except transform = LOG and NONE respectively) and the plots are not identical.

Any idea of what may go wrong ?

Thanks for your help !

Laure

Dear Kelsey,

My apologies for bothering you again. Did you had a chance to give a look to my last post ? Can you confirm that the data are plotted with the raw values? I am doubtful of this as different models resulted in different plots for the same metagenomic pathway. I must be missing something. This is important for us as we need to replicate the graphs using the classical R tools you mentionned.

Many thanks for your help and understanding,
Best wishes,

Hi @laure_bindels - it has been difficult to reproduce this without a code - would you be able to paste your MaAsLin 2 runs here so that we can debug? Based on the quick look at the source code, it looks like the figures should be identical across model runs (as @Kelsey_Thompson already mentioned) - so I am curious if there are additional parameters that are causing the differences in the output.

Thanks,
Himel

1 Like

Hi Himel,

Thanks for the reply.

Here are my lines of code.
I have rerun the functions, with the names reorganized.
I put also a screenshot of the same function in each model.

fit_data = Maaslin2(input_data = data.filter2, input_metadata = mapping_T0, output = “metacyc_maaslin2_model1”,
min_abundance = 1,
min_prevalence = 0,
max_significance = 0.1,
normalization = “NONE”, # data have been normalized in Humann3
transform = “LOG”,
analysis_method = “LM”,
# random_effects = c(“PatientID”),
fixed_effects = c(“newname”),
correction = “BH”
)

fit_data = Maaslin2(input_data = data.filter2, input_metadata = mapping_T0, output = “metacyc_maaslin2_model3”,
min_abundance = 1,
min_prevalence = 0,
max_significance = 0.1,
normalization = “NONE”,
transform = “NONE”,
analysis_method = “CPLM”,
# random_effects = c(“PatientID”),
fixed_effects = c(“newname”),
correction = “BH”
)

I also plotted the same values in R using the raw data.
image

Based on that, I concluded that model 3 outputs the raw data but not model 1.

So I also log-transformed the raw data and plotted them.

image

It seems like the log data are the ones plotted for model 1. Should I conclude that the data are plot after transformation ?

This discussion helped me figuring out how to replicate the Maaslin2 data in R, which was my first objective. I am happy to create a succint R environment with the complete files and share them privately, if it helps.

Many thanks for your time !

Laure

Hi @laure_bindels - thanks for the detailed exploration. On our end, we could not replicate this error as the source code appears to be plotting the raw values which should remain the same for all model combinations. Could you please make sure you are using the most updated version of the software either from Bioconductor (released version) or Github (development version)?

Many thanks,
Himel

Hi Himel,

Thanks for getting back to me
Sorry for the delay, I was out of office last week.

I had the version 1.0.0, and I was in R 3.6.3.
I was using the lines for installation from Bioconductor, but have just noticed that I was supposed to use R 4.1.

I reinstalled R studio (2021.09.1-372), R (4.1.2) and Maaslin2 (1.8.0).
I then ran the same analyses as above (model 1 and model 3).
And the plots are now the same.

Thanks for your patience and your advices, my apologies that the confusion came from such a basic issue.

Best wishes,

Laure