Compare Metaphlan4 output with CuratedMetagenomicData?

Hello everyone,
I am analysing some stool samples from a non-westernized population and I would like to compare my profile with that reported in the CuratedMetagenomicData dataset.
I run metaphlan4 on my samples and other samples from NCBI, with default parameters, and I merged the profile of my samples in a unique table. These are the commands used:

metaphlan SAMPLE --input_type fastq --nproc 3 --bowtie2out Metaphlan4/Sample.bowtie2.bz2 --bt2_ps very-sensitive --read_min_len 30 > Sample_profile.txt

merge_metaphlan_tables.py Metaphlan4/*.txt > DatasetMERGED_Metaphlan4.txt

Than I started working on curated metagenomic data selecting stool samples from healthy adults in this way:

cs<-sampleMetadata |>
filter(disease==“healthy”)|> filter(body_site ==“stool”) |> filter(age_category==“adult”)|> filter(country==c(“ITA”,“GBR”,“IRL”,“DNK”,“TZA”,“CMR”,“ETH”,“IDN”,“PER”,“FJI”))|>
returnSamples(“relative_abundance”,rownames=“short”)
assayNames(cs) <-“counts”
altExps(cs) <-splitByRanks(cs)
cs.ps<-makePhyloseqFromTreeSummarizedExperiment(cs,abund_values=“relative_abundance”)

cs.species<-tax_glom(cs.ps,taxrank=“species”)

Than I introduced my dataset to R and phyloseq using this function (Import a table of MetaPhlAn taxonomic abundances into phyloseq · GitHub)

mphlanin ← read.csv(“DatasetMERGED_Metaphlan4.txt”, sep = “\t”, strip.white = T, stringsAsFactors = F, row.names = 1)
metadata ← read.delim(“MappingDataset.txt”, header=TRUE, sep = “\t”)
metadatadf ← data.frame(metadata)
row.names(metadatadf) ← metadatadf$X
sample ← sample_data(metadatadf)
ps= metaphlanToPhyloseq(mphlanin)
ps.species<-tax_glom(ps,taxrank=“species”)
pseq ← transform(ps.species, “compositional”)
ps.all<-merge_phyloseq(pseq,cs.ps)

But when I perform some beta-diversity analysis I get that all the samples I run with metaphlan4 create a separate clusters with respected to that of curatedMetagenomicData, which is not expected to me since I am analysing close populations to that present to the database and I got the idea that there is some inner bias in che merging part of the two dataset. Has anyone experienced the same issue and how can I solve that? Which is the right way to compare my data to that of curatedMetagenomicData ?
thanks to everyone

I believe curatedMetagenomicData (cMD) currently contains profiles from MetaPhlAn 3 (but an update to MetaPhlAn 4 is in progress). Since MetaPhlAn 4 outputs profiles using a totally different (SGB-based) taxonomy, it makes sense that those profiles would spuriously look totally different from the cMD profiles. If you did the comparison at the genus level you would probably see something more sensible, although there would still be a big batch effect from combining v3 and v4 profiles.

Yes, indeed I have tried also at Genus level and it is more or less the same, the batch effect is still very visible. Ok so i have to wait for the update,
thank you a lot for your reply

Hi,

I would like to ask if the Metaphlan4 profiles of curatedMetagenomicData (cMD) are already available.

Thanks a lot in advance!
Asier