PanPhlAn threshold selection

Dear Developers,

I am working on PanPhlAn recently. And I have a few questions about the choice of thresholds, especially about “left_cov” and “right_cov”. I attached my current procedures below, along with questions. Thank you in advance for your patience.

(1) It seems to be necessary to select the threshold in a species depend manner. Below are a few examples. They are scatter plots of left_coverage and right_coverage (intermediate output of PanPhlAn, normalised by median coverage.)

The purple box is the default cutoff. The red box is my thresholds, manually selected after seeing this scatter plot. The example of F. Prausnitzii indicated that it is necessary to set the parameters by hand. Because the default parameters will not identify F. Prausnitzii in the majority of the sample, which is contradict against the fact that F. prausnitzii is one of the most common spp in healthy gut. And It is not alignment with Metaphlan3’s profile, which predicted the present of F prausnitzii in most of the samples.

(2) After manually choosing the parameters, I compare the results from PanPhlAn with MetaPhlAn. I would like to confirm that species identified by PanPhlAn is also identified by Metaphlan. But this is not always the case. For example, R. gnavus, I selected the parameters as the red box in above figure, R. gnavus is identified in 294 samples by PanPhlAn but only in 87 samples by Metaphlan. The overlapped samples between two methods are 78. image The number of gene family counts is shown in this graph.

I am worried that I am not doing correctly with the panphlan. Do you have any suggestions on my current choice of the thresholds? And why the results are not agreed between Metaphlan and Panphlan?

Looking forward to hearing from you!
Best wishes,
AuAs

Hello,

sorry for the late answer, I was away from work the past week.

First, thanks a lot for these detailed and relevant questions. I’ve actually never seen such way of visualizing PanPhlAn intermediate results, it is very interesting.

So let’s talk about the thresholds and their role first

Actually PanPhlAn provides a way a visualizing mapping results as coverage curves. This is an example coming from PanPhlan tutorial (not normalized)
image

I find this kind of visualization more useful to understand what’s happening sample-wise. Your plots are great for a full visualization of all samples. PanPhlAn profiling aims to define 3 regions in this curves, the one on the middle, the plateau, which is the most important. The one on the right of the plateau, the quasi-non present genes, and the one on the left of the plateau, the housekeeping genes.

It is very easy to assess that the genes whose coverage is part of the plateau are part of the genome of the dominant strain of the sample being profiled. Also, the very low coverage gene falling at the right of this plateau are easily considered non-present.
However, the genes on the left (with a very high coverage) are housekeeping gene, belonging to the species of interest but also other species, from the same genera, or even order… It is thus very difficult to assess that these genes are present in the very strain of the very species we are profiling even if we are sure that these genes are present in the metagenomics sample in the first place.

The provided threshold left_cov and right_cov are the expected values on the left and right side of the plateau. These values are checked at the positions defined as 30% and 70% of the genome length.

Basically, PanPhlAn expect at least 40% of all the genes indivudal coverage values to fall in between left_cov and right_cov values while having a median value of at least min_coverage (the 3rd threshold) to consider the strain detectable and present in the sample. Otherwise the sample is discarded.

On the bottom of this page you’ll find the default values of these thresholds. I personally use the very sensitive setup --min_coverage 1 --left_max 1.70 --right_min 0.30 for my usual analysis.

Using such thresholds with your examples above should detect F. prausnitzii more efficiently. For R. gnavus the coverage is quite high for a significant portion of the samples. I would be quite curious to check also the presence of other Ruminococcus in these samples (R. bromii, R. torques, that are also quite abundant). Basically, part of R. gnavus pangenome is overlapping with pangenomes of these other Ruminococcus. So, profiling the strain of the exact species is quite hard as it is extremely difficult to assess if a given gene is present or not in the strain of one species, when this very gene belong to several species.

And that basically leads us to your second question. MetaPhlAn analysis are made using a set of markers that are core of one species while being specific at the same time (gene found in R. gnavus all the time and only in R. gnavus). Here, the difference between the two softwares is not always obvious. MetaPhlAn uses a set of core species-specific gene to assess the abundance of the targeted species while PanPhlAn uses the whole pangenome of that species to assess the composition of the dominant strain in the sample. PanPhlAn should not be used to assess the presence of a bug, MetaPhlAn should. To illustrate this, let’s focus on your R. gnavus plot. I think that using the full red square, you are not filtering enough information and basically taking information conveyed by the presence of R. bromii, R. torques or other. Consequently, the number of sample in which you found R. gnavus present is way too high.

To conclude, for the later analysis, I first advise you to use the sensitive option of PanPhlAn --min_coverage 1 --left_max 1.70 --right_min 0.30. You’ll get more results than with the default one while doing some relevant filtering of the housekeeping genes. By doing so, you might get less samples with PanPhlAn compared to MetaPhlAn. This would make sense, as it is easier to measure the abundance of one species than to assess the composition of a strain. You can of course, play a bit with these thresholds, but keep in mind that the more sensitive you are, the less relevant and reliable the PanPhlAn output will be.

Sorry if this answer is a bit long, I hope it will still answer your questions. Feel free to ask if it’s not the case.
Léonard

Hi Léonard,

Thank you very much for the detailed answer! It’s very helpful. I still have some additional questions to ask you for advice.

(1) Would you suggest me to remove those strains identified by Panphlan but not by Metaphlan for further analysis, such as comparing strains from different samples?

(2) I am worried that the number of predicted gene families much larger than the average Reference average. Do you have any suggestions about how to handle this issue? And how to handle the multiple-strain in the same sample warning?

Looking forward to hearing from you.
Best wishes,
AuAs

Hello,

glad the answer helped !

(1) Would you suggest me to remove those strains identified by Panphlan but not by Metaphlan for further analysis, such as comparing strains from different samples?

I would say that it depends on the biological question and the confidence level that you want for the detected strains. But to make it simple, I would eventually discard samples (if any) for which PanPhlAn detects a strain but where MetaPhlAn does not detect the species at all ( = relative abundance for that sample is 0). Otherwise, If MetaPhlAn detects it, even at low abundance, let’s keep it in PanPhlAn, if PanPhlAn detects it as well.

(2) I am worried that the number of predicted gene families much larger than the average Reference average. Do you have any suggestions about how to handle this issue? And how to handle the multiple-strain in the same sample warning?

It is quite strange if the average genoem size of reference genome and genome sizes of strains are very different. Which is the species of interest in that case ? Anyway, it might be due to a lot of different things… I would not focus too much on the problem if it’s not messing too much with the overall analysis (like if it discard almost all samples because they do not pass the thresholds anymore) .

The multiple strain warning just tells you that the sample might contain co-present strains of the same species. It is detected if the coverage value at the rightmost part of the curve is still quite high. However, there usually is one dominant strains that is more frequent. PanPhlAn will focus on the gene composition of that one and not the other. If multi-strain detection is one the goal of your study, then you can filter out these samples and analyze them with a software specialized in multi-strains detection.

Best regards
Léonard

Thank you for the quick response! I was looking at many species at the same time. Actually, I have seen the cases where detected number of gene families much larger than the reference average. I attached a few more examples. These figures are distribution of the number of gene families for all the samples where the species is identified. Red line is the average for the reference genomes. The name of the species is specified in the figure’s title.


And here is the coverage plot for them.

What would you suggest me to do in this case?

Best wishes,
AuAs

Hello again !

if you computed the number of gene families per sample as the column sums of 1s in the final presence-absence matrix, this kind of results can actually make sense.

The final binary matrix gives you an information as 0 = the gene is absent in the sample (= its coverage was too low) and 1 = the gene is present in the sample (= its coverage was high enough) AND, in both cases, the gene family was part of the defined pangenome.

This means that a gene family can be detected present in a sample even if its coverage is very high (like a core/multicopy gene, eventually also present in other species). But yet, it is part of the species pangenome, detected in the sample, so, one cannot really assess that it is not present. Knowing this, it make sense that the detected strains tends to have more gene families.

To conclude, I would say your results are fine and look like what’s expected (the normalized coverage curves makes sense). Moreover, depending on the kind of biological question you are trying to answer using PanPhlAn, you might want to be more or less stringent with the sample filtering.

Best regards,
Léonard

Thank you for the suggestions!

Best wishes,
AuAs