sorry for the late answer, I was away from work the past week.
First, thanks a lot for these detailed and relevant questions. I’ve actually never seen such way of visualizing PanPhlAn intermediate results, it is very interesting.
So let’s talk about the thresholds and their role first
Actually PanPhlAn provides a way a visualizing mapping results as coverage curves. This is an example coming from PanPhlan tutorial (not normalized)
I find this kind of visualization more useful to understand what’s happening sample-wise. Your plots are great for a full visualization of all samples. PanPhlAn profiling aims to define 3 regions in this curves, the one on the middle, the plateau, which is the most important. The one on the right of the plateau, the quasi-non present genes, and the one on the left of the plateau, the housekeeping genes.
It is very easy to assess that the genes whose coverage is part of the plateau are part of the genome of the dominant strain of the sample being profiled. Also, the very low coverage gene falling at the right of this plateau are easily considered non-present.
However, the genes on the left (with a very high coverage) are housekeeping gene, belonging to the species of interest but also other species, from the same genera, or even order… It is thus very difficult to assess that these genes are present in the very strain of the very species we are profiling even if we are sure that these genes are present in the metagenomics sample in the first place.
The provided threshold
right_cov are the expected values on the left and right side of the plateau. These values are checked at the positions defined as 30% and 70% of the genome length.
Basically, PanPhlAn expect at least 40% of all the genes indivudal coverage values to fall in between
right_cov values while having a median value of at least
min_coverage (the 3rd threshold) to consider the strain detectable and present in the sample. Otherwise the sample is discarded.
On the bottom of this page you’ll find the default values of these thresholds. I personally use the very sensitive setup
--min_coverage 1 --left_max 1.70 --right_min 0.30 for my usual analysis.
Using such thresholds with your examples above should detect F. prausnitzii more efficiently. For
R. gnavus the coverage is quite high for a significant portion of the samples. I would be quite curious to check also the presence of other
Ruminococcus in these samples (R. bromii, R. torques, that are also quite abundant). Basically, part of R. gnavus pangenome is overlapping with pangenomes of these other
Ruminococcus. So, profiling the strain of the exact species is quite hard as it is extremely difficult to assess if a given gene is present or not in the strain of one species, when this very gene belong to several species.
And that basically leads us to your second question. MetaPhlAn analysis are made using a set of markers that are core of one species while being specific at the same time (gene found in R. gnavus all the time and only in R. gnavus). Here, the difference between the two softwares is not always obvious. MetaPhlAn uses a set of core species-specific gene to assess the abundance of the targeted species while PanPhlAn uses the whole pangenome of that species to assess the composition of the dominant strain in the sample. PanPhlAn should not be used to assess the presence of a bug, MetaPhlAn should. To illustrate this, let’s focus on your R. gnavus plot. I think that using the full red square, you are not filtering enough information and basically taking information conveyed by the presence of R. bromii, R. torques or other. Consequently, the number of sample in which you found R. gnavus present is way too high.
To conclude, for the later analysis, I first advise you to use the sensitive option of PanPhlAn
--min_coverage 1 --left_max 1.70 --right_min 0.30. You’ll get more results than with the default one while doing some relevant filtering of the housekeeping genes. By doing so, you might get less samples with PanPhlAn compared to MetaPhlAn. This would make sense, as it is easier to measure the abundance of one species than to assess the composition of a strain. You can of course, play a bit with these thresholds, but keep in mind that the more sensitive you are, the less relevant and reliable the PanPhlAn output will be.
Sorry if this answer is a bit long, I hope it will still answer your questions. Feel free to ask if it’s not the case.