What is the logic for determining a marker gene?

I’m trying to understand how Metaphlan databases are created. I understand that a marker gene is a core gene in a pangenome that is present only in that pangenome.

My questions:

  • Do you cluster in protein space or nucleotide space?
  • What percent identity (and query coverage) do you use as the minimum?
  • What is the minimum value of “coreness” for a gene to be considered (e.g., 100%)?
  • How important is it to consider multi-copy genes?

I found the following sources:

Hi @jolespin
Answering your question:

  • We cluster the protein space to define protein families (resemblying that as for UniRef). For markers we use 90%identity and 80% coverage with the cluster centroid (again, as for UniRef)
  • For coreness, marker genes could range between 50 to 100% coreness, depending on the species
  • It is important as it would bias the relative abundance estimation, however, the stat_q parameter will get rid of those markers found in an unexpected highly abundance in comparison with the other markers of the species present in the sample. So will mitigate that bias
