There have significant difference of abundance estimate using database_202103 or 202307

Hi community,

I utilized MetaPhlAn4 (version 4.0.6, bowtie version 2.5.2) while employing two distinct standard database versions, namely CHOCOPhlAnSGB_202103 and CHOCOPhlAnSGB_202307. This analysis was performed on a mock sample comprising solely two identified bacteria: Escherichia coli and Limosilactobacillus reuteri, facilitating subsequent comparisons. Intriguingly, there exists an approximate 2% disparity in the relative abundance of each species between the two runs. Besides, the estimated_reads_mapped_to_known_clades of two runs have significant difference: 33363735 of newest database CHOCOPhlAnSGB_202307 but 39835344 of CHOCOPhlAnSGB_202103. I guess this is the main reason for the difference in abundance estimates.

The reason I’m doing this comparison is because, I want to estimate the E. coli abundance in my cohort. But strangely, when using the latest version of the database, no E. coli was detected in any of the samples, while when using the older version, low abundance E. coli was detected in the same samples cohort. I would like to know which result is closer to the real situation, do you have any suggestions? In addition, why is the estimated_reads_mapped_to_known_clades value of the latest version lower than that of the old version?

Kindly refer to the attached results for a more detailed overview.
Thanks,
Nemo
HEM-25.metaphlan.202103.txt (2.3 KB)
HEM-25.metaphlan.202307.txt (2.3 KB)

please let me know if there have any progress?

Dear @axolotl233
From version to version, the set of marker genes for each species had change, and we expect an improvement with each version release in terms of accuracy and precision of the markers. That might explain the difference in the estimated number of reads mapped.