Am I doing it right...?

AndrewM · July 7, 2020, 2:12pm

Hello,

I’m just getting started with PanPhlAn for analysis of dominant microbes in throat swab shotgun metagenomic data. I have run PanPhlAn against species which have the most reads according to kraken2 (e.g. Rothia mucilaginosa mean 3.2 million read pairs per sample). I used the original Illumina fastq files (first pair, 150 bp), without filtering out human reads, or reads from other organisms.

I get a heatmap looking like this:

I can see that all Rothia in our samples cluster with one set of reference genomes.

I just want to check: (1) Does this look like the sort of results one might expect? (2) Am I better to filter out human and even other organism reads before running PanPhlAn, and (3) How best to handle paired end reads?

NB - insert size median ~250 bp, so many reads overlap. Perhaps best to merge?

Thanks for your advice,

Andrew

leonard.dubois · July 8, 2020, 12:09pm

Hello Andrew,

thanks for your interest for PanPhlAn.

Your heatmap looks like the kind of results one might expect. Depending on the species studied, or the number of samples provided as well as the parameters of the profiling step, the final results might differ. But in general, it looks like that.
Filter out human reads might speed up the mapping step. However, I do not advise you to filter other organisms. PanPhlAn analyses a coverage curve of all gene families in a pangenome and classify some as “multi-copy gene families”. These are gene families with a tremendous amount of copies in a metagenomic sample since there are housekeeping genes shared between multiple bacterial species. It is important to define this group in order to define gene families that do not belong to it and are specific genes of the species of interest.
For paired end reads, you can just merge all in a big fasta/fastq file and feed all to the script. Coverage is corrected and normalized based on median value, number of genes in the gene families and average length of genes. Moreover, you can display the coverage curve (arguement --o_covplot_normed in panphlan_profiling.py) and then adjust the thresholds for presence/absence detection. They are all the --min_coverage, --left_max, --right_min, --th_non_present, --th_present and --th_multicopy parameters. Usually we stick to the default value, but it can be relevant to play a bit with them to see if there’s a way to optimize the analysis.

I hope this answer all your question. Feel free to ask if you need more details.

Léonard Dubois

AndrewM · July 8, 2020, 12:32pm

Perfect, thanks!

Andrew

Topic		Replies	Views
Metaphlan output question MetaPhlAn	4	911	August 25, 2022
Data Preprocessing for Human Whole Genome Sequencing MetaPhlAn	3	605	May 27, 2022
Minimum depth for PanPhlAn PanPhlAn	9	578	November 8, 2021
About the Metaphlan category MetaPhlAn	1	827	July 27, 2022
Using a single "general" MetaPhlAn bugs list for all samples HUMAnN	3	299	July 20, 2023

Am I doing it right...?

Related topics