I was wondering how can I add a new genome to the metaphlan database. I am following these steps from the tutorial but it is not clear how to generate marker sequences from the query genome that need to be stored in a file called new_marker.fasta
Customizing the database
In order to add a marker to the database, the user needs the following steps:
Reconstruct the marker sequences (in fasta format) from the MetaPhlAn2 bowtie2 database by:
bowtie2-inspect metaphlan2/databases/mpa_v20_m200 > metaphlan2/markers.fasta
Add the marker sequence stored in a file new_marker.fasta to the marker set:
cat new_marker.fasta >> metaphlan2/markers.fasta
Rebuild the bowtie2 database:
Thanks for your support,
In order to add a new genome to the MetaPhlAn database you need to annotate the reference genome and identify marker genes which usually are genes that are core genes for the species and unique for the species (no other species included in the database should share the same gene).
I’ll refer you to this issue on the GitHub repository for more details https://github.com/biobakery/MetaPhlAn/issues/103
Thanks for the explanation. Another question: It seems that version 3.0 has ~ 110 eukaryotic reference genomes but there is none belonging to Pichiaceae, fungal family. Is there a selection criteria that you use to select reference genomes to build the database? Just wondering because it could bias the results of the taxonomic annotation
The genomes included are the one having an annotated reference genome in the UniProt Proteomes portal. To date I see that are available 10 genomes, but at the time the database was created, no one was present.