Encountering an error in sample2markers.py -i SRS013951.sam.bz2 -o pkl_files/

I am attempting to utilize Strainphlan by referring to the documentation. In Step 1, it should generate a sam.bz2 and bowtie2out.bz2 file, and indeed they are generated. However, upon proceeding to Step 2, an error regarding duplicate sequences is encountered. Interestingly, if I download the provided sam.bz2 file from the documentation, Step 2 is executed smoothly, generating a .pkl file. I have also attempted to download a simple fastq file from the documentation and repeated Step 1 and Step 2, but now it throws an error for that file as well. It would be greatly appreciated if someone could assist.

[W::sam_hdr_create] Duplicated sequence “VDB|001E-003E-0-000B|M801-c99-c0-c179” in file “/home/pipeline/mithun/tmpdmr8n4tw/SRS013951.sam”
[W::sam_hdr_create] Duplicated sequence “VDB|000C-0046-0-0004|M801-c99-c0-c180” in file “/home/pipeline/mithun/tmpdmr8n4tw/SRS013951.sam”
[W::sam_hdr_create] Duplicated sequence “VDB|0016-000D-0-0000|M801-c99-c0-c181” in file “/home/pipeline/mithun/tmpdmr8n4tw/SRS013951.sam”
[W::sam_hdr_create] Duplicated sequence “VDB|001D-00F6-0-0002|M801-c99-c0-c182” in file “/home/pipeline/mithun/tmpdmr8n4tw/SRS013951.sam”
[W::sam_hdr_create] Duplicated sequence “VDB|0016-005F-0-0004|M801-c99-c0-c183” in file “/home/pipeline/mithun/tmpdmr8n4tw/SRS013951.sam”
[W::sam_hdr_create] Duplicated sequence “VDB|0040-0140-0-0007|M801-c99-c0-c184” in file “/home/pipeline/mithun/tmpdmr8n4tw/SRS013951.sam”
[W::sam_hdr_create] Duplicated sequence “VDB|001D-008E-0-0002|M801-c99-c0-c185” in file “/home/pipeline/mithun/tmpdmr8n4tw/SRS013951.sam”
[W::sam_hdr_create] Duplicated sequence “VDB|0040-0146-0-0004|M801-c99-c0-c186” in file “/home/pipeline/mithun/tmpdmr8n4tw/SRS013951.sam”
[W::sam_hdr_create] Duplicated sequence “VDB|0040-0052-0-0003|M801-c99-c0-c187” in file “/home/pipeline/mithun/tmpdmr8n4tw/SRS013951.sam”
[W::sam_hdr_create] Duplicated sequence “VDB|0040-0139-0-0001|M801-c99-c0-c188” in file “/home/pipeline/mithun/tmpdmr8n4tw/SRS013951.sam”
[W::sam_hdr_create] Duplicated sequence “VDB|0046-0165-0-0003|M801-c99-c0-c189” in file “/home/pipeline/mithun/tmpdmr8n4tw/SRS013951.sam”
[W::sam_hdr_create] Duplicated sequence “VDB|0047-002F-0-0006|M801-c99-c0-c190” in file “/home/pipeline/mithun/tmpdmr8n4tw/SRS013951.sam”
[W::sam_hdr_create] Duplicated sequence “VDB|001D-011E-0-0007|M801-c99-c0-c191” in file “/home/pipeline/mithun/tmpdmr8n4tw/SRS013951.sam”
[W::sam_hdr_create] Duplicated sequence “VDB|001D-00C7-0-0008|M801-c99-c0-c192” in file “/home/pipeline/mithun/tmpdmr8n4tw/SRS013951.sam”
[W::sam_hdr_create] Duplicated sequence “VDB|003B-0000-0-021D|M489-c9-c0-c0” in file “/home/pipeline/mithun/tmpdmr8n4tw/SRS013951.sam”
[E::sam_hrecs_update_hashes] Duplicate entry “VDB|003B-0000-0-01C2|M1-c0-c0-c0” in sam header
samtools view: failed to add PG line to the header

[e] An error was ocurred executing a external tool, exiting…
Wed Jun 14 18:44:34 2023: Stop StrainPhlAn execution.

Dear @Mithun_Verma
Can you tell me which version of MetaPhlAn 4 (database and tool) are you using?

Thanks for your reply @aitor.blancomiguez
I have installed Metaphlan 4.0.2 and then installed the database using the following command :
" metaphlan --install --bowtie2db "

Hi Mithun, I just ran into this as well. In my situation, I catted together a bunch of runs with filtered and deduplicated ONT reads into one single input file - I am guessing you did something similar here - but this input file will contain duplicate reads.

So you need to do some pre-processing (i.e., deduplicate your input fastq file) before running metaphlan/strainphlan. There are many ways for you to accomplish this - text-based utilities like awk or paste would work, or you could use czid or seqkit rmdup or something similar. Hope this helps

Hi @aitor.blancomiguez
There are 45872 duplicated entries in the vOct22 database.
You can get them with the following command:

bowtie2-inspect mpa_vOct22_CHOCOPhlAnSGB_202212 | grep ‘^>’ | sort | uniq -d