Inquiry regarding the identification and removal of false-positive plasmids in non-iHMP datasets

Inquiry on reproducing the method for identifying false-positive plasmids (re: remove_plasmid.py)

Hello BAQLaVa team,

my name is min-uk, Park

Thank you for developing this great pipeline. While reviewing the scripts, I noticed remove_plasmid.py and the associated lists (MGX_plasmid_VGBs.txt and MTX_plasmid_VGBs.txt) which are used to filter out false-positive VGBs (plasmids) found in the iHMP dataset.

We are highly interested in applying a similar filtering approach to our own dataset to identify and remove any dataset-specific plasmids or host DNA that might be misclassified as viruses.

Could you please share how we can reproduce your approach? Specifically, we would like to ask:

  1. Reproduction of iHMP filtering: What exact pipeline, tools, or criteria did you use to generate the MGX/MTX_plasmid_VGBs.txt lists from the iHMP dataset?

  2. Recommended Workflow: If we want to check for plasmid sequences in a new dataset, would you recommend extracting the fasta sequences of the identified VGBs and running them through tools like geNomad or CheckV? Or is there a specific workflow you suggest for this purpose?

Thank you for your time and guidance!

Best regards,

min-uk, Park

seoul national university