Panphlan duplicate sequences

Hi there, I am running panphlan v3.1 installed via conda, and when I profile samples with the Prevotella_copri pangenome, I get the following duplicated sequence warnings after panphlan_map.py (the step appears to finish correctly and produce output):

[W::sam_hdr_parse] Duplicated sequence 'GG703878.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703877.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703876.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703875.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703874.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703873.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703872.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703871.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703870.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703869.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703868.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703867.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703866.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703865.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703864.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703863.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703862.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703861.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703860.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703859.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703858.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703857.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703856.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703855.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703854.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703853.1'
[W::sam_hdr_parse] Duplicated sequence 'GG703852.1' 

Indeed, several sequences in Prevotella_copri_pangenome_contigs.fna appear to be duplicated: is this a concern for the reliability of the resulting output?

Thanks!
Fiona

Hello Fiona,

indeed this kind of warning appear for some species, sometimes as warning, sometimes it makes the mapping fail (kind of a random issue yet to be fixed).
It is not a problem for the analysis since the issue is anyway neutralized by the normalization during the profiling process. However, if this is a problem anyhow (with other species for example, or to prevent the flood of warnings messages :smile:), I’ve recently added a small script in the PanPhlAn repository on GitHub called panphlan_clean_pangenome.py that will regenerate a “clean” pangenome. Just use :
panphlan_clean_pangenome.py --species Prevotella_copri --pangenome [path to the pangenome folder]

Hope this will be helpful.
Léonard

Thanks Léonard, I appreciate your reply! That’s helpful to know, thanks for linking to the cleaning script!

Best,
Fiona