Generation of MetaPhlAn 3 markers

Dear authors,

I was reading your 2021 eLife paper and was hoping you could help me with understanding the part with MetaPhlAn 3 marker generation.

On Page 20 in the section under “Generation of MetaPhlAn 3 markers”, in paragraph 3, it seems like the first step after length filtering is:

“We classify candidate markers into unique and quasi-markers according to the ‘uniqueness’ value: markers having zero ‘uniqueness’ are reported as ‘unique markers’. When no unique markers can be identified, the less-stringent thresholds used in the marker discovery procedure allows the identification of the so-called ‘quasi-markers’, markers having non-null values of ‘uniqueness’.”

Then in the next paragraph (paragraph 4, still on page 20), you started talking about the A, B, C, U four tiers, beginning with this sentence:

"The iterative approach started with the definition of four tiers of UNIQUE MARKERS according to a combination of the values of ‘coreness’, ‘uniqueness’, and ‘external_genomes’. "

In it, the unique markers (capitalized) is not quoted. My question is: after you classified candidate markers into unique and quasi-markers, did you further classify only the unique markers into four tiers? If so, what happens to the quasi-markers?

If not, then are those two paragraphs talking about the same procedure, with the 4th paragraph talking about how those ‘unique’ or ‘quasi-markers’ described in the 3rd paragraph come about? To me, they are two classifying systems. The one mentioned in the 3rd paragraph only cared about the ‘uniqueness’ score, while the four tiers in the next paragraph also considers ‘coreness’ score. This means all tier ‘U’ markers are ‘unique markers’, and some of the markers from other tiers can also be ‘unique markers’, as long as their ‘Uniqueness_NR90’ and ‘Uniqueness_NR50’ are both 0, no matter what their coreness scores are. I can understand either system, but I don’t know how they are related to each other.

Another question I have is, when you are talking about the iteration processes, you only start using the threshold of the next tier if you cannot selecting more than 50 markers using the current threshold. Therefore the ordering of the tiers are crucial. I understand that A is more stringent than B then than C, but I believe ‘U’ is the best of all. Shouldn’t we start with tier ‘U’? If so, I feel putting tier ‘U’ after ‘C’ in the description is a bit misleading. Or I’m missing the importance of the coreness here? Please help.

Sorry this is long and I hope I am explaining myself. Please let me know if I could make myself clearer in any ways.


