Inquiry on reproducing the method for identifying false-positive plasmids (re: remove_plasmid.py)
Hello BAQLaVa team,
my name is min-uk, Park
Thank you for developing this great pipeline. While reviewing the scripts, I noticed remove_plasmid.py and the associated lists (MGX_plasmid_VGBs.txt and MTX_plasmid_VGBs.txt) which are used to filter out false-positive VGBs (plasmids) found in the iHMP dataset.
We are highly interested in applying a similar filtering approach to our own dataset to identify and remove any dataset-specific plasmids or host DNA that might be misclassified as viruses.
Could you please share how we can reproduce your approach? Specifically, we would like to ask:
-
Reproduction of iHMP filtering: What exact pipeline, tools, or criteria did you use to generate the
MGX/MTX_plasmid_VGBs.txtlists from the iHMP dataset? -
Recommended Workflow: If we want to check for plasmid sequences in a new dataset, would you recommend extracting the fasta sequences of the identified VGBs and running them through tools like
geNomadorCheckV? Or is there a specific workflow you suggest for this purpose?
Thank you for your time and guidance!
Best regards,
min-uk, Park
seoul national university