What is the logic for determining a marker gene?

jolespin · October 5, 2023, 7:23pm

I’m trying to understand how Metaphlan databases are created. I understand that a marker gene is a core gene in a pangenome that is present only in that pangenome.

My questions:

Do you cluster in protein space or nucleotide space?
What percent identity (and query coverage) do you use as the minimum?
What is the minimum value of “coreness” for a gene to be considered (e.g., 100%)?
How important is it to consider multi-copy genes?

I found the following sources:

aitor.blancomiguez · March 5, 2024, 9:15am

Hi @jolespin
Answering your question:

We cluster the protein space to define protein families (resemblying that as for UniRef). For markers we use 90%identity and 80% coverage with the cluster centroid (again, as for UniRef)
For coreness, marker genes could range between 50 to 100% coreness, depending on the species
It is important as it would bias the relative abundance estimation, however, the stat_q parameter will get rid of those markers found in an unexpected highly abundance in comparison with the other markers of the species present in the sample. So will mitigate that bias

Topic		Replies	Views
How extract marker genes from MAGs? MetaPhlAn	2	1001	July 2, 2020
Building a custom marker genes DB for running Metaphlan MetaPhlAn	1	474	July 18, 2022
Building a MetaPhlAn database from fasta sequences MetaPhlAn	2	983	June 28, 2024
Phylophlan - creating database of markers PhyloPhlAn	9	1323	November 26, 2021
Origin clade specific marker genes MetaPhlAn	1	431	July 18, 2022

What is the logic for determining a marker gene?

Related topics