Strainphlan sample2markers.py ERROR

Dear developers,

I met a problem when running strainphlan(under metaphlan version=4.2.4) with command ”sample2markers.py -i vdb3.sam -f sam -o consensus_markers2 -n 1 -d /mpa_vOct22_CHOCOPhlAnSGB_202403.pkl”.It shows:[E::sam_hrecs_refs_from_targets_array] Duplicate entry “VDB|003B_0000_0_01C2|M1_c0_c0_c0” in target list
[E::sam_parse1] failed to parse header

I wonder if sample2markers.py cannot identification the information which in the SAM file that start with “VDB“.Because I reviewed the script and found that lines 58-59 showed “if marker.startwih(“VDB”): return Fasle“, and in another test I conducted, after deleting all the information strating with “VDB“ in the SAM file, the script could run normally

Hello @LEEzhu0110 ,

I believe this is a problem with an older MetaPhlAn database. Some viral “VDB” markers were duplicated producing duplicate entries in the SAM header which subsequently failed in sample2markers. I think this was a problem with “mpa_vOct22_CHOCOPhlAnSGB_202212”, which was then fixed in “mpa_vOct22_CHOCOPhlAnSGB_202403”. I see you’re using the newer one in sample2markers but maybe you ran MetaPhlAn with the older 2022 one? You can check by looking at the first lines of the “*_profile.tsv” file.

The most correct solution would be to re-profile your samples with newer metaphlan DB, I would suggest using the newest Jan25. If you want to stick to Oct22, you can use the 2024 fixed version.

The simplest but “hacky” solution is to filter the SAM file to remove the VDB entries, as you pointed out in the sample2markers code, they are not used anyway. Something like the following:

bzcat /your/sample.sam.bz2 | grep -v "VDB|" | bzip2 -zc > /your/sample__no_VDB.sam.bz2

and then use the filtered sam file for sample2markers.

Btw, your SAM file does not look like coming from MetaPhlAn/bowtie2 or maybe it was processed somehow?

Let me know if it helps

Michal

1 Like