Batch correction for microbiome data

Dear Siyuan Ma,

I hope this message finds you well. I am a PhD researcher working with shotgun metagenomic sequencing data from a microbiome study, and I have been using your MMUPHin package to adjust for batch effects. I truly appreciate the thought and effort you have put into building a tool that’s tailored to the unique challenges of microbiome data.

I wanted to reach out to describe my dataset and analysis pipeline in detail, and ask for your advice on whether I am using MMUPHin appropriately, especially given the specific structure and confounding in my data.

Background of the Study:
We conducted a longitudinal study to investigate microbiome differences across seasons in a meat processing environment. We collected a total of 684 samples, with 337 from the summer phase and 347 from the winter phase.

However, due to the design and logistics of the study, we ended up using two different DNA extraction methods:

TCM method was used for most of the summer samples (286 samples).

Molysis Complete5 method was used for all of the winter samples (347 samples) and a small subset of the summer samples (51 samples).

As a result, season (our main biological variable of interest) and DNA extraction method (a technical batch effect) are strongly confounded. This has made batch correction especially challenging, since we want to preserve the biological signal of seasonal variation while removing any technical noise from the extraction method.

Input Data:
We analyzed shotgun metagenomic data using MetaPhlAn output to obtain relative abundances of microbial taxa. The resulting data matrix (abundance_df2) has:

Samples as rows, and 3930 taxa (species level) as columns. We also created a metadata dataframe (metadata_df2) with:

Phase as the variable indicating season (Summer/Winter), and DNA_Extraction_Method as the batch variable.

MMUPHin Batch Correction:
We transposed the abundance matrix so that taxa were in rows and samples in columns, and then converted relative abundances from percentages to proportions (by dividing all values by 100). We removed all-zero samples before running PERMANOVA tests.

We ran the MMUPHin adjust_batch() function with the following arguments:

library(MMUPHin)
library(magrittr)
library(tidyverse)
library(ggplot2)
library(vegan)

Transpose MetaPhlAn data: samples are rows, need taxa as rows, samples as columns

taxa_mat ← t(as.matrix(abundance_df2))

Confirm dimensions

dim(taxa_mat) # should be taxa x samples

Convert critical variables to factors

metadata_df2$DNA_Extraction_Method ← factor(metadata_df2$DNA_Extraction_Method)
metadata_df2$Phase ← factor(metadata_df2$Phase)

Ensure rownames of metadata match column names of abundance table

all(rownames(metadata_df2) %in% colnames(taxa_mat)) # should return TRUE
all(colnames(taxa_mat) %in% rownames(metadata_df2)) # should return TRUE

Reorder metadata to match abundance columns

metadata_df2 ← metadata_df2[match(colnames(taxa_mat), metadata_df2$Sample), ]

Set sample IDs as rownames (required by MMUPHin)

rownames(metadata_df2) ← metadata_df2$Sample

Convert MetaPhlAn relative abundances (in %) to proportions

taxa_mat ← taxa_mat / 100

Check to confirm values now are in the [0, 1] range

range(taxa_mat)

Should now be between 0 and 1

#To be absolutely sure your input is good:
summary(as.vector(taxa_mat))
all(taxa_mat >= 0 & taxa_mat <= 1) # Should return TRUE

adjusted ← adjust_batch(
feature_abd = taxa_mat,
batch = “DNA_Extraction_Method”,
covariates = “Phase”,
data = metadata_df2,
control = list(
zero_inflation = TRUE,
diagnostic_plot = “mmuphin_diagnostics.pdf”,
verbose = TRUE
)
)

Extract corrected matrix

adjusted_taxa ← adjusted$feature_abd_adj

Save corrected data

write.csv(t(taxa_mat), “MMUPHin_Input_Proportions.csv”)
write.csv(t(adjusted_taxa), “MMUPHin_Adjusted_Proportions.csv”)

If you want to go back to % for easier plots:

adjusted_perc ← adjusted_taxa * 100
write.csv(t(adjusted_perc), “MMUPHin_Adjusted_Percentages.csv”)

#Validation

Identify samples (columns) with only zeros in taxa_mat

nonzero_samples_before ← colSums(taxa_mat) > 0
nonzero_samples_after ← colSums(adjusted_taxa) > 0

Filter out zero-sum samples

filtered_taxa_mat ← taxa_mat[, nonzero_samples_before]
filtered_adjusted_taxa ← adjusted_taxa[, nonzero_samples_after]

Update metadata to include only non-zero samples (same order as matrix columns)

metadata_before ← metadata_df2[colnames(filtered_taxa_mat), ]
metadata_after ← metadata_df2[colnames(filtered_adjusted_taxa), ]

Bray-Curtis distances

bc_before ← vegdist(t(filtered_taxa_mat), method = “bray”)
bc_after ← vegdist(t(filtered_adjusted_taxa), method = “bray”)

PERMANOVA

adonis_before ← adonis2(bc_before ~ DNA_Extraction_Method, data = metadata_before)
adonis_after ← adonis2(bc_after ~ DNA_Extraction_Method, data = metadata_after)

Results

print(adonis_before)
print(adonis_after)

This ran successfully, and we obtained an adjusted abundance matrix (adjusted_taxa), which we used for downstream analysis.

PERMANOVA Results:
To evaluate batch effect before and after correction, we used Bray-Curtis distance and ran adonis2() from the vegan package. We observed:

Before correction:
R² = 0.02613 (~2.6%) for DNA_Extraction_Method, p = 0.001

After correction:
R² = 0.04206 (~4.2%) for DNA_Extraction_Method, p = 0.001

This result surprised us, as we expected the R² for DNA extraction to decrease after batch correction. I understand that MMUPHin protects covariates (like Phase), and since extraction and phase are confounded, some signal may remain. Still, I would like to confirm:
Does an increase in R² after correction indicate that we applied MMUPHin incorrectly, or is it an expected outcome given the design?

Final Questions:
Have we used MMUPHin correctly, based on our input format and arguments?

Is it normal in confounded designs for PERMANOVA R² of the batch variable to increase after correction?

Would you still recommend using the MMUPHin-adjusted data for alpha/beta diversity and ordination?

Any guidance or clarification you can provide would be incredibly helpful for properly interpreting and reporting our results. Thank you again for developing such a useful tool for the microbiome community.

Best regards,
Asim
PhD Researcher
University of Naples Federico II, Naples, Italy
asimur.rahman@unina.it