Dear developers,
I met a problem when running strainphlan(under metaphlan version=4.2.4) with command ”sample2markers.py -i vdb3.sam -f sam -o consensus_markers2 -n 1 -d /mpa_vOct22_CHOCOPhlAnSGB_202403.pkl”.It shows:[E::sam_hrecs_refs_from_targets_array] Duplicate entry “VDB|003B_0000_0_01C2|M1_c0_c0_c0” in target list
[E::sam_parse1] failed to parse header
I wonder if sample2markers.py cannot identification the information which in the SAM file that start with “VDB“.Because I reviewed the script and found that lines 58-59 showed “if marker.startwih(“VDB”): return Fasle“, and in another test I conducted, after deleting all the information strating with “VDB“ in the SAM file, the script could run normally
Thanks for your help! looking forward to your reply!
Hello @LEEzhu0110 ,
I believe this is a problem with an older MetaPhlAn database. Some viral “VDB” markers were duplicated producing duplicate entries in the SAM header which subsequently failed in sample2markers. I think this was a problem with “mpa_vOct22_CHOCOPhlAnSGB_202212”, which was then fixed in “mpa_vOct22_CHOCOPhlAnSGB_202403”. I see you’re using the newer one in sample2markers but maybe you ran MetaPhlAn with the older 2022 one? You can check by looking at the first lines of the “*_profile.tsv” file.
The most correct solution would be to re-profile your samples with newer metaphlan DB, I would suggest using the newest Jan25. If you want to stick to Oct22, you can use the 2024 fixed version.
The simplest but “hacky” solution is to filter the SAM file to remove the VDB entries, as you pointed out in the sample2markers code, they are not used anyway. Something like the following:
bzcat /your/sample.sam.bz2 | grep -v "VDB|" | bzip2 -zc > /your/sample__no_VDB.sam.bz2
and then use the filtered sam file for sample2markers.
Btw, your SAM file does not look like coming from MetaPhlAn/bowtie2 or maybe it was processed somehow?
Let me know if it helps
Michal
1 Like
Hello! @Michal_Puncochar,
Thank you for your help! I did indeed remove the “VDB” data from the sam file in the subsequent calculations.Afterwards i follow the instructions shown in: strainphlan4 · biobakery/biobakery Wiki.Because it seems that the version is more consistent with the one I used for Metaphlan.
Once again, I would like to express my gratitude to you.
Best wishes!
Hello!@Michal_Puncochar
I’m currently facing a new problem and I’m not sure if I can take up some more of your time.
When I was operating according to the guide as shown in strainphlan4 · biobakery/biobakery Wiki.When it reaches “Step 4: Generate trees from alignments”,I used the following command:<strainphlan -s consensus_markers5/* -m clade_markers/t__SGB6173.fna -r reference_genomes/*.fna.bz2 -o output3 -c t__SGB6173 --phylophlan_mode fast --nproc 4>.At the very beginning, I didn’t add any filtering parameters. Although the info file in the tree file indicates that the last retained samples were 22 in number, the final tree still contains less than 10 samples.After I adjusted the threshold for filtering, it still didn’t show any significant improvement.
I would like to ask, is this situation here because my data is like this, or do I need to modify more parameters?
I’m very sorry to bother you again.Looking forward to your reply.
Wish you all the best!
Hello!@Michal_Puncochar
I’m very sorry that I didn’t reply in time before.Because my account was just lifted from the mute status today.Thank you very much for your explanations and clarifications.
Best wishes!