I know that this is really nitpicking but I had to ask this.
According to the following paper that you have published:
Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases,
the analysis of your data has been done according to the code located here:
hmp2_analysis bitbucket
(The code is said to have moved to GitHub, however I wasn’t able to find it there and open an issue as such).
Anyway, the paper states that you got your results using bioBakery workflows v2, while I tried to replicate the results using bioBakery workflows v3 and compare them. I am trying to re-run the analysis’ results for the MetaPhlAn part in the bioBakery pipeline.
You first load MetaPhlAn’s results and its metadata in ./hmp2_analysis/common/load_bugs.r
into a combined object called bugs.pcl
(bugs.pcl$meta
represents the meta file, while bugs.pcl$x
represents the data file).
Inside load_bugs.r
you call a method called pcl.read
(its definition can be found in /common/pcl_utils.r
code).
I changed the reading method so it would be easier to load the data and its metadata.
pcl.read.mine <-function(data.path, metadata.path, file.type, rowfeatures = T) {
# data.path - (chr) path to file result from biobakery.
# metadata.path - (chr) path to human microbiome project 2's metadata.
# file.type - "taxonomy"/"humann pathways"/etc. - (chr) which type of biobakery result file we are providing as input.
# rowfeatures - (bool) whether the features are in the rows of the data.
# returns a list with
# - meta = metadata (data.frame),
# - x = data (matrix) [samples x features]
# - ns = number of samples (numeric),
# - nf = number of features (numeric).
library(stringr)
data <- read.delim(data.path, row.names = 1, check.names = T, stringsAsFactors = F)
metadata <- read.delim(metadata.path, row.names = NULL, check.names = T, stringsAsFactors = F)
# Assigning rownames for metadata while considering that the first column (our row names) has duplicates:
# Taking the first columm which consists sample names from the metadata.
sampnames <- metadata[,1]
dups <- duplicated(sampnames)
if (any(dups)) {
# if there are any duplicated values in the first column then make it unique by appearance number.
sampnames[dups] <- sprintf("%s_#%d", sampnames[dups], cumsum(dups)[dups])
}
# assign first column as index (row) names
rownames(metadata) <- sampnames
if (rowfeatures) {
# Transpose if rows are features in the data (this converts the data.frame into matrix)
data <- t(data)
}
# Make sure the data frame is converted into matrix
data <- data.matrix(data)
if(file.type == "taxonomy") {
rownames(data) <- str_remove(rownames(data), "_taxonomic_profile")
} else {
# Update this for future use for other file types.
stop(sprintf("Incorrect file type given"))
}
return (list(x=data, meta=metadata, ns=dim(data)[1], nf=dim(data)[2]))
}
I post this so other people will be able to load the files more easily, as I had problems loading the files.
Long story short - the code expects a combined text file with metadata rows in the beginning and then rows with the data. This text file is saved as a pcl extension. The problem is that R has a problem reading such file, so instead I simply read the metadata and data files separately and created the desired object for analysis use.
That said, this is not my main point.
My point is that I have noticed that the function pcl.filter.s
from /common/pcl_utils.r
which is supposed to filter the samples according to a certain criterion does not consider the order of the samples in the metadata and data files in the combined object called bugs.pcl
.
For example, in load_bugs.r
you try to leave only the samples which had above 1 million reads:
bugs.pcl <- bugs.pcl %>% fix_metadata %>%
pcl.filter.s(reads_filtered >= 1e6)
This is all good and well, however, once delving into the function pcl.filter.s
, you filter the rows as follows:
dat$x <- dat$x[keep,,drop=F]
dat$meta <- dat$meta[keep,,drop=F]
Where dat
is bugs.pcl
, x
is the data file, meta
is the metadata file, and keep
is a boolean array indicating which rows to keep.
This doesn’t work well if the order of the samples is not the same in the metadata (dat$meta
) and the data matrix (dat$x
).
I have tried looking whether you have re-ordered the metadata and data files somewhere else, but this doesn’t seem to be the case. So unless you have created or loaded already ordered metadata and data files this filtering method might be wrong.
I fixed this in load_bugs.r
as follows:
# Read in raw table
bugs.pcl <- pcl.read.mine(data.path = file.path(HMP2_data, "mgx", "taxonomic_profiles.tsv"),
metadata.path = file.path(HMP2_data, "metadata", "hmp2_metadata_2018-08-20.csv"),
file.type = "taxonomy")
# I added this line as well, otherwise the code stops due to NA values in the column reads_filtered in the metadata
bugs.pcl$meta <- bugs.pcl$meta %>% filter(data_type == 'metagenomics')
stopifnot(all(!is.na(bugs.pcl$meta$reads_filtered)))
# re-ordering the metadata's rows to match the data's rows' order.
sample_idx <- match(rownames(bugs.pcl$x), bugs.pcl$meta$External.ID)
bugs.pcl$meta <- bugs.pcl$meta[sample_idx, ]
# Filter out low-read depth samples and fix metadata
# keeping only samples which had above 1M reads.
bugs.pcl <- bugs.pcl %>% fix_metadata %>%
pcl.filter.s(reads_filtered >= 1e6)
Now, I know you have already published your paper, and I presume you have ordered the metadata and data files accordingly, but just wanted to let others pay attention to this.