"Please provide the reference for the variable" error when running Maaslin2

Hello!

I am trying to run Maaslin2 with the code:

input_data = read.table(file = "4Masslin2_input.data_kos.taxonomy.archaea.mt.2group.tsv",
                        header = TRUE, sep = "\t")
rownames(input_data) <- input_data$Geneid_ord
input_data$Geneid_ord = NULL

metadata = read.table(file = "4Masslin2_metadata_kos.taxonomy.archaea.mt.2group.tsv",
                      header = TRUE, sep = "\t")
rownames(metadata) <- metadata$Geneid_ord
metadata$Geneid_ord = NULL

# Create the 'Ctrl' column
metadata$Ctrl <- ifelse(metadata$Diagnosis == "Ctrl", "Yes", "No")

# Create the 'PD' column
metadata$PD <- ifelse(metadata$Diagnosis == "PD", "Yes", "No")

# Create the 'iRBD' column
metadata$iRBD <- ifelse(metadata$Diagnosis == "iRBD", "Yes", "No")

reference <- unique(metadata$S)
reference <- c("Methanobrevibacter_A smithii","Methanobrevibacter_A smithii_A","Methanosphaera stadtmanae","Methanomethylophilus alvus","DTU008 sp001421185","Methanomassiliicoccus luminyensis","MX-02 sp006954405","Coprobacillus cateniformis","Methanobrevibacter_C arboriphilus_A","Methanosphaera cuniculi")

Maaslin2(input_data = input_data,
         input_metadata = metadata,
         fixed_effects = c("Ctrl", "PD", "iRBD", "S"),
         reference = reference,
         min_prevalence = 0,
         output = "test",
         transform = "LOG",
         plot_heatmap = TRUE,
         plot_scatter = TRUE,
         heatmap_first_n = 50,
         max_significance = 1)

Examples of my metadata and input data are below:

metadata:

         Diagnosis       D                 P               C                       O                       F                    G
K00053_1      Ctrl Archaea Methanobacteriota Methanobacteria      Methanobacteriales     Methanobacteriaceae Methanobrevibacter_A
K00053_2      Ctrl Archaea Methanobacteriota Methanobacteria      Methanobacteriales     Methanobacteriaceae Methanobrevibacter_A
K00053_3      Ctrl Archaea Methanobacteriota Methanobacteria      Methanobacteriales     Methanobacteriaceae       Methanosphaera
K00053_4      Ctrl Archaea  Thermoplasmatota  Thermoplasmata Methanomassiliicoccales Methanomethylophilaceae Methanomethylophilus
K00053_5        PD Archaea Methanobacteriota Methanobacteria      Methanobacteriales     Methanobacteriaceae Methanobrevibacter_A
K00053_6        PD Archaea Methanobacteriota Methanobacteria      Methanobacteriales     Methanobacteriaceae Methanobrevibacter_A
                                      S Ctrl  PD iRBD
K00053_1   Methanobrevibacter_A smithii  Yes  No   No
K00053_2 Methanobrevibacter_A smithii_A  Yes  No   No
K00053_3      Methanosphaera stadtmanae  Yes  No   No
K00053_4     Methanomethylophilus alvus  Yes  No   No
K00053_5   Methanobrevibacter_A smithii   No Yes   No
K00053_6 Methanobrevibacter_A smithii_A   No Yes   No

input_data:

                tpm
K00053_1 166.502489
K00053_2 188.409788
K00053_3  69.970092
K00053_4   2.219452
K00053_5 642.522944
K00053_6 136.308126

As a result I receive an error:

2023-05-11 17:25:04 INFO::Writing function arguments to log file
2023-05-11 17:25:04 INFO::Verifying options selected are valid
2023-05-11 17:25:04 INFO::Determining format of input files
2023-05-11 17:25:04 INFO::Input format is data samples as rows and metadata samples as rows
2023-05-11 17:25:04 INFO::Formula for fixed effects: expr ~  Ctrl + PD + iRBD + S
Error in Maaslin2(input_data = input_data, input_metadata = metadata,  : 
  Please provide the reference for the variable 'S' which includes more than 2 levels: Methanobrevibacter_A smithii, Methanobrevibacter_A smithii_A, Methanosphaera stadtmanae, Methanomethylophilus alvus, Methanomassiliicoccus_A intestinalis, UBA71 sp905187815, DTU008 sp001421185, Methanomassiliicoccus luminyensis, MX-02 sp006954405, Coprobacillus cateniformis, Methanobrevibacter_C arboriphilus_A, Methanosphaera cuniculi, Methanobrevibacter ruminantium_A.

Could you please suggest a solution to the error and probably the source of it?

Hi there,

It seems the variable S has multiple levels and as such you need to pick a reference variable that is to be compared against during model construction. The reference variable should take in structure:

c(“S,Methanobrevibacter_A smithii”)

This would indicate that Methanobrevibacter_A smithii would be the reference level/category used for the variable S.

Hope that helps!

Cheers,
Jacob