Building a MetaPhlAn database from fasta sequences

Hello !
I would like to compare the Kraken2, SRA STAT and MetaPhlAn 4.0 tools, and to do this I would like to use the same database. I built my Kraken2 and SRA STAT databases from fasta files (from RefSeq, obtained after the Kraken2 database was created).
I did see the docs on GitHub and discussions like Customizing Chochophlan panproteome and Metaphlan marker gene databases with new taxa , How to create marker sequences from a genome to add to metaphlan database? and [BUG] database installation error · Issue #103 · biobakery/MetaPhlAn · GitHub.
Is there a pipeline or script to create your own MetaPhlAn 4.0 database from fasta sequences? Or is there something like it?
Thanks in advance!

Hi @ecalfapietra
Currently, there is not script to generate a new metaphlan 4 database from scratch.
But if you are interested on doing it, the procedure goes as follows:

  1. Classify your genomes into species-level genome bins (SGB) by clustering them at 95% genome identity
  2. For each SGB, annotate the FASTA sequences and define a set of core gene families (clustering the CDS at 90% identity)
  3. Map all the core gene families against the initial set of genomes to define SGB-specific and unique set of marker genes
    For a deeper explanation, you can have a look at the m&m of the metaphlan 4 paper: Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4 | Nature Biotechnology
2 Likes