Hi, thank you very much for this wonderful tool.
I would like to identify marker genes from clusters of genomes in order to create a customized Metaphlan database.
I have looked in Segata’s paper from 2012 and found the following key steps:
Identification of clade-specific core genes:
- Identify non-redundant genes from each genome of the species
- Cluster NR genes based on 75% nucleotide sequence identity threshold
- Realign clade-specific gene families against the raw genomes
- Compute the posterior probability density function (using the beta distribution)
Screening of core genes for unique taxonomic marker genes:
5. Exclude core genes that are not uniquely present in a clade
6. Exclude multi-copy genes if possible (from step 1)
7. Define “uniqueness index” for core-genes
My questions are:
A) Is there some script that you can share to help me run those steps?
B) If not (for question A), Can you explain the 4th and the 7th steps?
unfortunately, I don’t have any script for extracting markers from MAGs, we have an ad hoc pipeline for generating the MetaPhlAn database, but it’s not straightforwardly editable in order to handle MAGs since the data source is different. In the pyphlan repo (https://github.com/SegataLab/pyphlan), there’re present some script called
choco_ but I don’t have any guidance for running/using them.
In the current version of the pipeline, we don’t take into account step 4. For a species A, the “uniqueness index” is calculated as the number of species besides A in which the marker is also present. This can be easily calculated by mapping all the markers identified to the full set of reference genomes.
Can someone be contacted to understand how to use this script? The info at https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-3.0#customizing-the-database states how to add markers, but not how to add markers in a manner that matches the default database (eg., how to cluster).