The metaphlan database, mpa_vOct22_CHOCOPhlAnSGB_202403
, has duplicate sequences, resulting in duplicate @SQ
headers in SAM output. Bowtie2 errors on those when using a .sam
file produced by metaphlan4. This duplication seems to be similar to that reported here: VSG sequence duplications found in MetaPhlAn4 vOct22 pre-built bowtie2 index file
An example duplicated sequence:
bunzip2 -c mpa_vOct22_CHOCOPhlAnSGB_202403.fna.bz2 | grep 'VDB|003B-0000-0-01C2|M1-c0-c0-c0'
>VDB|003B-0000-0-01C2|M1-c0-c0-c0 240413_kVSG_NC_0274021_Salmonella_phage_SPN3US
>VDB|003B-0000-0-01C2|M1-c0-c0-c0 240413_kVSG_NC_0274021_Salmonella_phage_SPN3US
An example error:
Thu Jun 26 16:35:33 2025: Start samples to markers execution
Thu Jun 26 16:35:33 2025: Using samtools version 1.21
Thu Jun 26 16:35:33 2025: Creating temporary directory...
Thu Jun 26 16:35:33 2025: Done.
Thu Jun 26 16:35:33 2025: Filtering SAM files...
Thu Jun 26 16:35:33 2025: Loading MetaPhlAn mpa_vOct22_CHOCOPhlAnSGB_202403 database...
Thu Jun 26 16:36:00 2025: Done.
Thu Jun 26 16:36:03 2025: Setting default parameters for mapper bowtie2
[E::sam_hrecs_refs_from_targets_array] Duplicate entry "VDB|003B-0000-0-01C2|M1-c0-c0-c0" in target list
[E::sam_parse1] failed to parse header
Traceback (most recent call last):
File "/usr/local/bin/sample2markers.py", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.9/site-packages/metaphlan/utils/sample2markers.py", line 587, in main
sampletomarkers.run_sample2markers()
File "/usr/local/lib/python3.9/site-packages/metaphlan/utils/sample2markers.py", line 438, in run_sample2markers
self.filter_sam_files()
File "/usr/local/lib/python3.9/site-packages/metaphlan/utils/sample2markers.py", line 423, in filter_sam_files
self.input = execute_pool(((SampleToMarkers.parallel_filter_sam, i, self.tmp_dir, self.input_format,
File "/usr/local/lib/python3.9/site-packages/metaphlan/utils/parallelisation.py", line 102, in execute_pool
return list(gen)
File "/usr/local/lib/python3.9/site-packages/metaphlan/utils/parallelisation.py", line 95, in <genexpr>
gen = (function(*a) for function, *a in args)
File "/usr/local/lib/python3.9/site-packages/metaphlan/utils/sample2markers.py", line 386, in parallel_filter_sam
aln = pysam.AlignedSegment.fromstring(line, header)
File "pysam/libcalignedsegment.pyx", line 1124, in pysam.libcalignedsegment.AlignedSegment.fromstring
ValueError: parsing SAM record string failed (error code -2)