Hi @ecalfapietra
Currently, there is not script to generate a new metaphlan 4 database from scratch.
But if you are interested on doing it, the procedure goes as follows:
- Classify your genomes into species-level genome bins (SGB) by clustering them at 95% genome identity
- For each SGB, annotate the FASTA sequences and define a set of core gene families (clustering the CDS at 90% identity)
- Map all the core gene families against the initial set of genomes to define SGB-specific and unique set of marker genes
For a deeper explanation, you can have a look at the m&m of the metaphlan 4 paper: Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4 | Nature Biotechnology