Metaphlan database has duplicate entries, causing downstream failures

The metaphlan database, mpa_vOct22_CHOCOPhlAnSGB_202403, has duplicate sequences, resulting in duplicate @SQ headers in SAM output. Bowtie2 errors on those when using a .sam file produced by metaphlan4. This duplication seems to be similar to that reported here: VSG sequence duplications found in MetaPhlAn4 vOct22 pre-built bowtie2 index file

An example duplicated sequence:

bunzip2 -c mpa_vOct22_CHOCOPhlAnSGB_202403.fna.bz2 | grep 'VDB|003B-0000-0-01C2|M1-c0-c0-c0'
>VDB|003B-0000-0-01C2|M1-c0-c0-c0 240413_kVSG_NC_0274021_Salmonella_phage_SPN3US
>VDB|003B-0000-0-01C2|M1-c0-c0-c0 240413_kVSG_NC_0274021_Salmonella_phage_SPN3US

An example error:

Thu Jun 26 16:35:33 2025: Start samples to markers execution
Thu Jun 26 16:35:33 2025: Using samtools version 1.21
Thu Jun 26 16:35:33 2025: Creating temporary directory...
Thu Jun 26 16:35:33 2025: Done.
Thu Jun 26 16:35:33 2025: Filtering SAM files...
Thu Jun 26 16:35:33 2025: Loading MetaPhlAn mpa_vOct22_CHOCOPhlAnSGB_202403 database...
Thu Jun 26 16:36:00 2025: Done.
Thu Jun 26 16:36:03 2025: Setting default parameters for mapper bowtie2
[E::sam_hrecs_refs_from_targets_array] Duplicate entry "VDB|003B-0000-0-01C2|M1-c0-c0-c0" in target list
[E::sam_parse1] failed to parse header
Traceback (most recent call last):
  File "/usr/local/bin/sample2markers.py", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/site-packages/metaphlan/utils/sample2markers.py", line 587, in main
    sampletomarkers.run_sample2markers()
  File "/usr/local/lib/python3.9/site-packages/metaphlan/utils/sample2markers.py", line 438, in run_sample2markers
    self.filter_sam_files()
  File "/usr/local/lib/python3.9/site-packages/metaphlan/utils/sample2markers.py", line 423, in filter_sam_files
    self.input = execute_pool(((SampleToMarkers.parallel_filter_sam, i, self.tmp_dir, self.input_format,
  File "/usr/local/lib/python3.9/site-packages/metaphlan/utils/parallelisation.py", line 102, in execute_pool
    return list(gen)
  File "/usr/local/lib/python3.9/site-packages/metaphlan/utils/parallelisation.py", line 95, in <genexpr>
    gen = (function(*a) for function, *a in args)
  File "/usr/local/lib/python3.9/site-packages/metaphlan/utils/sample2markers.py", line 386, in parallel_filter_sam
    aln = pysam.AlignedSegment.fromstring(line, header)
  File "pysam/libcalignedsegment.pyx", line 1124, in pysam.libcalignedsegment.AlignedSegment.fromstring
ValueError: parsing SAM record string failed (error code -2)

Hi @seandavi
Thanks for spotting that, the issue was with the viral sequences only, so the rest of the mappings shouldn’t be affected by this. I fixed the files, you could either rerun the analysis with the updated database or remove the VDB hits if your aim is to run sample2markers.
Thanks!