I try to understand MetaPhlAn’s rules to include a clade/taxon in the final relative abundance composition. I calculated the percentage of nonzero markers for each clade (called PC_hit_markers here, calculated from marker_counts output). The first thing I noticed was that some clades were not included although they had higher PC_hit_marker values than some of the included clades. This was due to the disambiguating procedure as these clades were included when I used --avoid_disqm.
The second thing I noticed was that many clades were included with PC_hit_markers values below 33% although the default --perc_nonzero value is 0.33. This got far more pronounced when I set --stat_q to 0.01 (leaving --perc_nonzero untouched). With these settings, several clades made it into the composition for which only 1 marker gene was hit.
I would therefore like to ask:
What is the ordered set of rules to decide whether a clade is included (based on --stat_q and --perc_nonzero and potentially other arguments I missed)?
And a related question: what is the easiest way to document the number of reads (or read pairs) that were used for the calculation of the final compositions?
Thank you for posting this question.I am dealing with the same issue and I was not able to understand what --perc_nonzerois doing actually. Initially I thought it was the % of markers to be present to calculate averages for a clade but it seems like it is not.
I ran some iterations and it looks like --perc_nonzero is about the quasi -markers (the ones with less uniqness score as far as I understand) also because you can completely override it with --avoid_disqm flag.
Thanks Ayse! This is interesting. I had no time to look further into this (and probably will only come back to it in February). It would be great if there was an easy-to-find set of rules or simply a clearer description of the arguments in the help file. For example, the help on --perc_nonzero reads: “Percentage of markers with a non zero relative abundance for misidentify a species [default 0.33]”. I guessed this would mean "proportion of markers that need to have a non-zero relative abundance to include a species (clade/taxon) in the composition (for example a species has 100 markers, I set it to 0.33, then at least 33 markers must be non-zero)’. But that doesn’t seem to be so simple. Thanks for any kind of input on the actual ruleset. (and Happy New Year:)).