Offer Pre-built Databases Organised By Biome

I saw that “Sequences (fasta files; due to large size they are split into 5 parts)” at Species-level genome bins (SGBs) from the human microbiome. Could they be split by biome instead? For example, I am interested in analysing oral cavity cancer whole genome sequencing data (the subset of short Illumina reads which are not mapped to the human reference genome), so the database I would like to use does not need to have, for instance, marine bacteria in it. Ideally, I could just download a Human Oral database and work with that for this project. Is that feasible or not to offer?

Also, are there any details public yet about how mpa_vJan21_CHOCOPhlAnSGB_202103 was constructed? It would be great if users could reproduce creating CHOCOPhlAn from scratch.

Hi @Dario
Unfortunately, due to the size of the data, it will be unfeasible for us to re-upload the data splitted by biome. However, you should be able to filter the MAGs assembled from oral samples from the full dataset using the supplementary table 1 in the original publication: Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle: Cell
The MAG identifiers should follow the structure: StudyID__SampleID__binID
You can find the details of how the MetaPhlAn 4 database vJan21 was built in the following preprint: https://www.biorxiv.org/content/10.1101/2022.08.22.504593v1

Actually, that preprint contains zero instances of the word CHOCOPhlAn. Is it somewhere else?

Hi @Dario
While it is not called CHOCOPhlAn anymore, the methods of the preprint describe the whole procedure from genomes to marker genes.