Building a MetaPhlAn database from fasta sequences

Hi @ecalfapietra
Currently, there is not script to generate a new metaphlan 4 database from scratch.
But if you are interested on doing it, the procedure goes as follows:

  1. Classify your genomes into species-level genome bins (SGB) by clustering them at 95% genome identity
  2. For each SGB, annotate the FASTA sequences and define a set of core gene families (clustering the CDS at 90% identity)
  3. Map all the core gene families against the initial set of genomes to define SGB-specific and unique set of marker genes
    For a deeper explanation, you can have a look at the m&m of the metaphlan 4 paper: Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4 | Nature Biotechnology
2 Likes