I saw that “Sequences (fasta files; due to large size they are split into 5 parts)” at Species-level genome bins (SGBs) from the human microbiome. Could they be split by biome instead? For example, I am interested in analysing oral cavity cancer whole genome sequencing data (the subset of short Illumina reads which are not mapped to the human reference genome), so the database I would like to use does not need to have, for instance, marine bacteria in it. Ideally, I could just download a Human Oral database and work with that for this project. Is that feasible or not to offer?
Also, are there any details public yet about how mpa_vJan21_CHOCOPhlAnSGB_202103 was constructed? It would be great if users could reproduce creating CHOCOPhlAn from scratch.