Hello everyone,
I am analysing some stool samples from a non-westernized population and I would like to compare my profile with that reported in the CuratedMetagenomicData dataset.
I run metaphlan4 on my samples and other samples from NCBI, with default parameters, and I merged the profile of my samples in a unique table. These are the commands used:
metaphlan SAMPLE --input_type fastq --nproc 3 --bowtie2out Metaphlan4/Sample.bowtie2.bz2 --bt2_ps very-sensitive --read_min_len 30 > Sample_profile.txt
merge_metaphlan_tables.py Metaphlan4/*.txt > DatasetMERGED_Metaphlan4.txt
Than I started working on curated metagenomic data selecting stool samples from healthy adults in this way:
cs<-sampleMetadata |>
filter(disease==“healthy”)|> filter(body_site ==“stool”) |> filter(age_category==“adult”)|> filter(country==c(“ITA”,“GBR”,“IRL”,“DNK”,“TZA”,“CMR”,“ETH”,“IDN”,“PER”,“FJI”))|>
returnSamples(“relative_abundance”,rownames=“short”)
assayNames(cs) <-“counts”
altExps(cs) <-splitByRanks(cs)
cs.ps<-makePhyloseqFromTreeSummarizedExperiment(cs,abund_values=“relative_abundance”)
cs.species<-tax_glom(cs.ps,taxrank=“species”)
Than I introduced my dataset to R and phyloseq using this function (Import a table of MetaPhlAn taxonomic abundances into phyloseq · GitHub)
mphlanin ← read.csv(“DatasetMERGED_Metaphlan4.txt”, sep = “\t”, strip.white = T, stringsAsFactors = F, row.names = 1)
metadata ← read.delim(“MappingDataset.txt”, header=TRUE, sep = “\t”)
metadatadf ← data.frame(metadata)
row.names(metadatadf) ← metadatadf$X
sample ← sample_data(metadatadf)
ps= metaphlanToPhyloseq(mphlanin)
ps.species<-tax_glom(ps,taxrank=“species”)
pseq ← transform(ps.species, “compositional”)
ps.all<-merge_phyloseq(pseq,cs.ps)
But when I perform some beta-diversity analysis I get that all the samples I run with metaphlan4 create a separate clusters with respected to that of curatedMetagenomicData, which is not expected to me since I am analysing close populations to that present to the database and I got the idea that there is some inner bias in che merging part of the two dataset. Has anyone experienced the same issue and how can I solve that? Which is the right way to compare my data to that of curatedMetagenomicData ?
thanks to everyone