Missing sample from the panphlan profile step

Dear authors,

I find both PanPhlAn and PanPhlAn3 might missed some samples in the panphlan profile step.

In PanPhlAn, sometimes it will report errors like:
[TERMINATING…] /usr/local/bin/panphlan_profile.py, 0.23 minutes.
QUALITY WARNING: sample GCF_000598405_errfree_r0 may contain multiple strains, PanPhlAn extracts the dominant strain
QUALITY WARNING: gene-families of sample GCF_000598405_errfree_r0 may come from multiple strains
number of gene-families: 6739 is 10% higher than expected number (average of ref. genomes): 4527

However, in PanPhlAn3, no warning messages were reported at all.

Moreover, I could not really understand the theoretical logic behind this error. Could you please explain a little about how I should explain it?

Thank you!

Yuxiang Tan

Hi,

These messages are a warning of the multi strain detection in PanPhlAn. In PanPhlAn 3 you might get some message looking like this: G88797_map.tsv WARNING: sample may contain multiple strains

This comes from the plateau detection step. The coverage of all the genes in the pangenome are sorted and normalized and can be visualized as a coverage curve looking like this:

Here you have an example with 3 samples (blue, red and yellow curves). PanPhlAn aims to detect 3 parts in the curve.

  1. is the high coverage genes, usually shared across a lot of species, house-keeping genes/ multicopy genes. There are detected as present in the metagenome, but you will never be able to tell for sure if they come from your targeted species as they can belong to a lot of different species

  2. is the actual plateau, highlighting the composition of the dominant strain in the sample.

  3. Then usually the curve falls. then there are genes present in the pangenome that are not detected in the sample.
    However, in some cases (yellow curve for example), you might see a second smaller plateau at a lower coverage. This usually indicates that it exists another strain in the sample, with genes not present in the dominant strain. While it is very difficult to disentangle that kind of signal, it is interesting to know that for a targeted species and in that particular sample, 2 strains with different gene composition might coexist.

I hope this answers your question, let me know if something is still unclear.
Btw, PanPhlAn is a bit outdated, just use PanPhlAn 3

Have a nice day,

Léonard

Hi, Léonard.

Thank you so much for you detailed explanation and it is very helpful.

However I am still confusing and have two following questions:

  1. I ran 20 samples and only 8 samples had results, but I got no warning message at all.


    That’s my main confusion.

  2. Why the sample with multiple strains should return with no result? From my opinion, “dominant strain” means it admits that more than one strain can coexist. If there are two or more strains had very similar abundance, should they together be consider the dominant subspecies, rather than report nothing? Or there are other concerns about doing so?

Thank you!

Yuxiang

If you add the -v or --verbose parameter you will get detailed about all your samples, especially the thresholds values.

Here samples can be discarded if they do not pass the left threshold (max expected value at the line between part 1 and 2 on the figure above), the right threshold (min expected value at the line between part 2 and 3). The multi strain is actually not discarding the sample by itself, only printing a warning for you to know.

If you feel like too many samples are discarded with the default thresholds, feel free to tune them as in here (bottom of the page)
I usually use the sensitive setup --min_coverage 1 --left_max 1.70 --right_min 0.30

OK, I get it. Thank you!

One more question. What\s the drawback or trade off of using the sensitive setup? More false positive gene families? Or something else?
Also, what does it mean if the left_max is high? Shared across a lot of species, house-keeping genes/ multicopy genes are too little? Or any other biological hypothesis?

Best,

Yuxiang

Hi, Léonard:

What’s the drawback or trade off of using the sensitive setup? More false positive gene families? Or something else?
I am still not to sure how to understand what’s the biological meaning behind these settings.

Best,

Yuxiang

Hi, sorry for the late reply

PanPhlan actually consists of two step or questions:

  1. Is a targeted species present in a sample ?
  2. If yes, what’s its gene composition ?

More sensitive setup will include more strains in the profile matrix (1). That means that whether or not your targeted species is present in a sample will depend on these thresholds. Of course, changing them a bit can be more sensitive in the species detection, which can be useful in case of low abundance/low sequencing depth. The more extreme cases (way too sensitive thresholds) would generate a gene presence-absence profile (2) even with a species being not present in your sample. However, this could actually be seen in the presence-absence matrix as such detected strain would have a very low or very high number of genes compared to the average gene number of reference strains.

Tell me if it’s still unclear

Hi, Léonard:

Thank you for the reply, but I am still confuse.

In my understanding now, in the sensitive mode (comparing to default), the left_max is to move the first line to the left in the figure and right min is to move the second line to the right. If the margin/turning point of the plateau is outside this window, then this sample is consider to have no target species.and filtered. Other than that, this sample will be kept. Am I correct?

If I am correct, then in the figure, the part 1 of a sample, which is kept in the sensitive mode but filtered in the default mode, should be narrow than regular (which will be kept in the default). So what is the biological meaning of this scenario? The number of house keeping genes and multicopy genes are low in the sample? Similarly, what’s meaning of part 3 in the figure?

Or, if I am wrong, how should I understand these two parameters?

Thank you!

Yuxiang

Better see the problem with that figure:

To answer the first question (is the species present ?), PanPhlAn search for a plateau in the curve.

The range of search for the plateau on the x axis is fixed (if I remember correctly it’s 0.3 and 0.7 times the average number of genes of reference genomes ).
The normalization makes the median value of coverage at 1.

The left_max and right_min parameters are the upper and lower value on the y axis (normalized coverage) that you expect in the most extreme cases.
That would mean that if no more than 30% of the genes have a coverage that is above 1.7 times the median AND that no more than 30% of the genes have a coverage that is below 0.3 times the median, then the sample will pass detection and the gene presence-absence will be computed.

If a sample does not pass this detection, meaning either the curve goes above the point A or below the point B, then the sample is discarded from the analysis.
Being way too sensitive (for example --left_max 5 --right_min 0.1) would make samples with the species absent (or in very low abundance and almost not covered by your sequencing data) actually pass the detection, in such cases, the presence-absence computed from it would be completely irrelevant.

If a sample pass this step, all parts (1, 2,3 on the very first figure) are considered. Then the assessment of gene presence or absence is another part of the PanPhlAn analysis.

I don’t know which samples/species you’re profiling so I’m not sure I can explain it better without concrete examples…

Hi, Léonard:

Thank you so much and this is very clear, which I misunderstood before!

This is very helpful!

Best,

Yuxiang