Duplicated fasta entries in pangenomes

Hi @leonard.dubois ,

Some pangenomes have duplicated fasta entries.

bowtie2-inspect Cutibacterium_acnes | seqkit rmdup > /dev/null
[INFO] 58 duplicated records removed

This is a problem for the bam to sam conversion:

[E::sam_hrecs_update_hashes] Duplicate entry "GL383855.1" in sam header
samtools view: failed to add PG line to the header
[W::hts_set_opt] Cannot change block size for this format
samtools sort: failed to read header from "-"

Could you fix this?

Florian

Hi!

We are currently working on a new version of the database that will fix the issue along with expanding the pangenomes.

In the mean time you can use the panphlan_clean_pangenome.py script from the GitHub repo. That should do the work

Thanks Leonard.

I wasn’t aware of this script so I performed the cleanup with my own.
I guess you already know it but the metaphlan4 vOct22 database has the same issue.
Btw, could you upload a new panphlan release on bioconda. The current version is quite old and buggy.

Florian