Dear Siyuan Ma,
I hope this message finds you well. I am a PhD researcher working with shotgun metagenomic sequencing data from a microbiome study, and I have been using your MMUPHin package to adjust for batch effects. I truly appreciate the thought and effort you have put into building a tool that’s tailored to the unique challenges of microbiome data.
I wanted to reach out to describe my dataset and analysis pipeline in detail, and ask for your advice on whether I am using MMUPHin appropriately, especially given the specific structure and confounding in my data.
Background of the Study:
We conducted a longitudinal study to investigate microbiome differences across seasons in a meat processing environment. We collected a total of 684 samples, with 337 from the summer phase and 347 from the winter phase.
However, due to the design and logistics of the study, we ended up using two different DNA extraction methods:
TCM method was used for most of the summer samples (286 samples).
Molysis Complete5 method was used for all of the winter samples (347 samples) and a small subset of the summer samples (51 samples).
As a result, season (our main biological variable of interest) and DNA extraction method (a technical batch effect) are strongly confounded. This has made batch correction especially challenging, since we want to preserve the biological signal of seasonal variation while removing any technical noise from the extraction method.
Input Data:
We analyzed shotgun metagenomic data using MetaPhlAn output to obtain relative abundances of microbial taxa. The resulting data matrix (abundance_df2) has:
Samples as rows, and 3930 taxa (species level) as columns. We also created a metadata dataframe (metadata_df2) with:
Phase as the variable indicating season (Summer/Winter), and DNA_Extraction_Method as the batch variable.
MMUPHin Batch Correction:
We transposed the abundance matrix so that taxa were in rows and samples in columns, and then converted relative abundances from percentages to proportions (by dividing all values by 100). We removed all-zero samples before running PERMANOVA tests.
We ran the MMUPHin adjust_batch() function with the following arguments:
library(MMUPHin)
library(magrittr)
library(tidyverse)
library(ggplot2)
library(vegan)
Transpose MetaPhlAn data: samples are rows, need taxa as rows, samples as columns
taxa_mat ← t(as.matrix(abundance_df2))
Confirm dimensions
dim(taxa_mat) # should be taxa x samples
Convert critical variables to factors
metadata_df2$DNA_Extraction_Method ← factor(metadata_df2$DNA_Extraction_Method)
metadata_df2$Phase ← factor(metadata_df2$Phase)
Ensure rownames of metadata match column names of abundance table
all(rownames(metadata_df2) %in% colnames(taxa_mat)) # should return TRUE
all(colnames(taxa_mat) %in% rownames(metadata_df2)) # should return TRUE
Reorder metadata to match abundance columns
metadata_df2 ← metadata_df2[match(colnames(taxa_mat), metadata_df2$Sample), ]
Set sample IDs as rownames (required by MMUPHin)
rownames(metadata_df2) ← metadata_df2$Sample
Convert MetaPhlAn relative abundances (in %) to proportions
taxa_mat ← taxa_mat / 100
Check to confirm values now are in the [0, 1] range
range(taxa_mat)
Should now be between 0 and 1
#To be absolutely sure your input is good:
summary(as.vector(taxa_mat))
all(taxa_mat >= 0 & taxa_mat <= 1) # Should return TRUE
adjusted ← adjust_batch(
feature_abd = taxa_mat,
batch = “DNA_Extraction_Method”,
covariates = “Phase”,
data = metadata_df2,
control = list(
zero_inflation = TRUE,
diagnostic_plot = “mmuphin_diagnostics.pdf”,
verbose = TRUE
)
)
Extract corrected matrix
adjusted_taxa ← adjusted$feature_abd_adj
Save corrected data
write.csv(t(taxa_mat), “MMUPHin_Input_Proportions.csv”)
write.csv(t(adjusted_taxa), “MMUPHin_Adjusted_Proportions.csv”)
If you want to go back to % for easier plots:
adjusted_perc ← adjusted_taxa * 100
write.csv(t(adjusted_perc), “MMUPHin_Adjusted_Percentages.csv”)
#Validation
Identify samples (columns) with only zeros in taxa_mat
nonzero_samples_before ← colSums(taxa_mat) > 0
nonzero_samples_after ← colSums(adjusted_taxa) > 0
Filter out zero-sum samples
filtered_taxa_mat ← taxa_mat[, nonzero_samples_before]
filtered_adjusted_taxa ← adjusted_taxa[, nonzero_samples_after]
Update metadata to include only non-zero samples (same order as matrix columns)
metadata_before ← metadata_df2[colnames(filtered_taxa_mat), ]
metadata_after ← metadata_df2[colnames(filtered_adjusted_taxa), ]
Bray-Curtis distances
bc_before ← vegdist(t(filtered_taxa_mat), method = “bray”)
bc_after ← vegdist(t(filtered_adjusted_taxa), method = “bray”)
PERMANOVA
adonis_before ← adonis2(bc_before ~ DNA_Extraction_Method, data = metadata_before)
adonis_after ← adonis2(bc_after ~ DNA_Extraction_Method, data = metadata_after)
Results
print(adonis_before)
print(adonis_after)
This ran successfully, and we obtained an adjusted abundance matrix (adjusted_taxa), which we used for downstream analysis.
PERMANOVA Results:
To evaluate batch effect before and after correction, we used Bray-Curtis distance and ran adonis2() from the vegan package. We observed:
Before correction:
R² = 0.02613 (~2.6%) for DNA_Extraction_Method, p = 0.001
After correction:
R² = 0.04206 (~4.2%) for DNA_Extraction_Method, p = 0.001
This result surprised us, as we expected the R² for DNA extraction to decrease after batch correction. I understand that MMUPHin protects covariates (like Phase), and since extraction and phase are confounded, some signal may remain. Still, I would like to confirm:
Does an increase in R² after correction indicate that we applied MMUPHin incorrectly, or is it an expected outcome given the design?
Final Questions:
Have we used MMUPHin correctly, based on our input format and arguments?
Is it normal in confounded designs for PERMANOVA R² of the batch variable to increase after correction?
Would you still recommend using the MMUPHin-adjusted data for alpha/beta diversity and ordination?
Any guidance or clarification you can provide would be incredibly helpful for properly interpreting and reporting our results. Thank you again for developing such a useful tool for the microbiome community.
Best regards,
Asim
PhD Researcher
University of Naples Federico II, Naples, Italy
asimur.rahman@unina.it