Hi, thank you very much for this wonderful tool.
I would like to identify marker genes from clusters of genomes in order to create a customized Metaphlan database.
I have looked in Segata’s paper from 2012 and found the following key steps:
Identification of clade-specific core genes:
- Identify non-redundant genes from each genome of the species
- Cluster NR genes based on 75% nucleotide sequence identity threshold
- Realign clade-specific gene families against the raw genomes
- Compute the posterior probability density function (using the beta distribution)
Screening of core genes for unique taxonomic marker genes:
5. Exclude core genes that are not uniquely present in a clade
6. Exclude multi-copy genes if possible (from step 1)
7. Define “uniqueness index” for core-genes
My questions are:
A) Is there some script that you can share to help me run those steps?
B) If not (for question A), Can you explain the 4th and the 7th steps?
Thanks,
Almog.