The bioBakery help forum

Minimum depth for PanPhlAn


What would you advise as a minimum depth per sample for PanPhlAn? Presumably for a two group comparison, I could pool species-specific reads from each group if necessary to reach a required depth.

This would be for Abiotrophia defectiva, for which a pangenome does not exist, so it may be that I end up with only two “genomes” generated from my sample groups, and then generate a pangenome based on the tool you release this year.



Hi Andrew!

I’m not sure I understood your question here…

If it’s related to the sequencing depth required for PanPhlAn, don’t worry, the script uses a set of thresholds that will discard samples having a strange coverage curves, too many genes with low coverage or overall low coverage pangenome…

If it is about creating some custom pangenome, there’s this tool PanPhlAn_pangenome_exporter. A bit more tricky to use and time consuming than PanPhlAn but quite powerful. You could for example provide MAGs from your samples (if any available) to build a pangenome.

I don’t know if this answers you question, feel free to ask if it’s still unclear :sweat_smile:

1 Like

Quite appropriately you’ve hit detected two parts to the question! I have a species which is covered to an average of 40,000 reads across 221 samples. They are unevenly distributed, so I actually have 18 MAGs, each from distinct samples, of >50% completeness.

Before I realised how many MAGs I’d generated already, I was considering pooling genus specific reads (extracting based on kraken IDs) and doing a group-wise co-assembly. However, quick experiments show very poor assemblies, probably due to much strain-level variation (I may see if STRONG can help here, if I have the time and capacity).

I have successfully managed to generate pangenomes with the exporter tool (thanks!) and am now awaiting all the panphlan results.

When undertaking statistical comparisons, do you find you need to take depth into account – clearly gene detection will be better in some samples (up to 1 million reads) than others (few thousand). Is the simple censoring of low coverage pangenomes enough that I can be confident presence/absence in sufficiently covered samples is correct?

Best wishes,



Seems like a very interesting analysis, I actually never tried using MAGs for custom pangenome generation…yet. Definitely something I want to try in the future :smiley:

From my own experience, running PanPhlAn with sensitive option --min_coverage 1 --left_max 1.70 --right_min 0.30 usually filters the sample in a satisfying way. However, I work with large scale analysis, usually with more than a thousand samples and don’t mind discarding too much of them.

I would also advise to use these parameters: --th_non_present 0.25 --th_present 0.5 instead of the default ones. It will be more stringent/robust on the single gene presence/absence assessment.

With these parameters, profile everything and look if depth could be a confounding factor. I would advise you to check

  • the average/distribution of the number of genes families present in strains
  • the pairwise Jaccard distance between groups (intra group/inter groups comparison)

However I guess in your case you should also find a way to disentangle that effect from the variability that can happen between your groups of interest. Depends on the openness of your species pangenome, the size of its core genome compared to the accessory one…

Feel free to ask if some things are still not clear


1 Like

Thanks @leonard.dubois - really helpful. I note these have now become the default thresholds (if this could appear in the command help text, it would be useful). Am now analysing my second species - wonderful tool!

Looking across my 130 samples for which PanPhlAn could produce output (of 219), I see that the number of genes detected has a median 1541, IQR 1399-1666, and range 1001-1829. Because only two whole genomes exist, it will be difficult to define core/accessory genome.

I think it might make sense to look for SCGs and tRNAs – I’ll see if I can track down these gene names for my organism of interest from checkm output.

130 out of 219 is not so bad at all!

You can also define core/accessory based on the “empirical prevalence” (from the PanPhlAn presence absence matrix). Thus, defining the pangenome parts (core, soft/extended core, shell, cloud) as if you had 130 genomes.
Usually with Heatmap visualization, these parts are rather easy to spot

Fair point! I’ll just have to accept that prevalence will likely not hit 100% due to stochastic effects. I’ll pause my efforts to cross-reference Pfam/TIGR names from the Lactobacillales SCGs in checkm with the UniRef protein namaes!

It seems there is a pretty reasonable cut-off around 75-80, which corresponds to ~60% prevalence. There are just over 1500 core genes by these definitions.

Does that seem sensible? I can then potentially use the number of core genes present to adjust the presence/absence of other genes. I wonder if more nuanced statistical correction would be allowed by using the underlying read count data.

Thanks again for your time and advice.

There is no golden fixed standard for that as far as I know.
I usually use the thresholds from this paper Methods section:

Classification thresholds for pangenome subcomponents were defined as follows: core genes–present in all strains; extended core—present in >90% of genomes; cloud genes—present in <15% (includes unique genes in pangenome); the remaining part of pangenome was considered “shell” genes (Supplementary Fig. 1). These thresholds are based on default parameters of the Roary pipeline

In other PanPhlAn analysis for simpler classification, I use 95% prevalence for core, and 5% for unique/rare gene with everything in between as shell.

The problem with read count is that in the core genome you will find:

  • the core-specific genome with read count correlating with relative abundance of the species (basically MetaPhlAn markers genes)
  • the “shared” core genome, housekeeping genes shared with other species that might have a huge read count
1 Like