Am I doing it right...?


I’m just getting started with PanPhlAn for analysis of dominant microbes in throat swab shotgun metagenomic data. I have run PanPhlAn against species which have the most reads according to kraken2 (e.g. Rothia mucilaginosa mean 3.2 million read pairs per sample). I used the original Illumina fastq files (first pair, 150 bp), without filtering out human reads, or reads from other organisms.

I get a heatmap looking like this:


I can see that all Rothia in our samples cluster with one set of reference genomes.

I just want to check: (1) Does this look like the sort of results one might expect? (2) Am I better to filter out human and even other organism reads before running PanPhlAn, and (3) How best to handle paired end reads?

NB - insert size median ~250 bp, so many reads overlap. Perhaps best to merge?

Thanks for your advice,


Hello Andrew,

thanks for your interest for PanPhlAn.

  1. Your heatmap looks like the kind of results one might expect. Depending on the species studied, or the number of samples provided as well as the parameters of the profiling step, the final results might differ. But in general, it looks like that.

  2. Filter out human reads might speed up the mapping step. However, I do not advise you to filter other organisms. PanPhlAn analyses a coverage curve of all gene families in a pangenome and classify some as “multi-copy gene families”. These are gene families with a tremendous amount of copies in a metagenomic sample since there are housekeeping genes shared between multiple bacterial species. It is important to define this group in order to define gene families that do not belong to it and are specific genes of the species of interest.

  3. For paired end reads, you can just merge all in a big fasta/fastq file and feed all to the script. Coverage is corrected and normalized based on median value, number of genes in the gene families and average length of genes. Moreover, you can display the coverage curve (arguement --o_covplot_normed in and then adjust the thresholds for presence/absence detection. They are all the --min_coverage, --left_max, --right_min, --th_non_present, --th_present and --th_multicopy parameters. Usually we stick to the default value, but it can be relevant to play a bit with them to see if there’s a way to optimize the analysis.

I hope this answer all your question. Feel free to ask if you need more details.

Léonard Dubois

Perfect, thanks!