Inquiry on reproducing the method for identifying false-positive plasmids (re: remove_plasmid.py)
Hello BAQLaVa team,
my name is min-uk, Park
Thank you for developing this great pipeline. While reviewing the scripts, I noticed remove_plasmid.py and the associated lists (MGX_plasmid_VGBs.txt and MTX_plasmid_VGBs.txt) which are used to filter out false-positive VGBs (plasmids) found in the iHMP dataset.
We are highly interested in applying a similar filtering approach to our own dataset to identify and remove any dataset-specific plasmids or host DNA that might be misclassified as viruses.
Could you please share how we can reproduce your approach? Specifically, we would like to ask:
-
Reproduction of iHMP filtering: What exact pipeline, tools, or criteria did you use to generate the MGX/MTX_plasmid_VGBs.txt lists from the iHMP dataset?
-
Recommended Workflow: If we want to check for plasmid sequences in a new dataset, would you recommend extracting the fasta sequences of the identified VGBs and running them through tools like geNomad or CheckV? Or is there a specific workflow you suggest for this purpose?
Thank you for your time and guidance!
Best regards,
min-uk, Park
seoul national university
This approach was specific to the HMP2 viral profile dataset. In analysis, we identified that a small number of VGBs were not behaving as expected and potentially contained plasmid genetic material along with viral genomes within the VGB (though this is also complicated by the existence of phage-plasmids, etc… its a messy world!).
To address this, after BAQLaVa profiling, we used the marker & ORF tempfiles (representing the mapped abundance of each) and PPR-meta plasmid probability scores to identify VGBs that when observed by BAQLaVa in our dataset, were likely due to markers or ORFs originating from plasmid-like material rather than viral-like material (e.g. the markers we observed mapped to were those from genomes classified by PPR-meta as likely-plasmid). We did not systematically remove these VGBs in the manuscript analysis, but we did ensure that any highlighted examples of interesting viruses or viral traits were not originating from these VGBs. However, other laboratory members doing subsequent analysis with BAQLaVa on new datasets have since been removing this set of VGBs with the remove_plasmid.py script and associated set of VGBs.
This approach is imperfect in that it will remove the set of VGBs identified as problematic within our specific dataset, but it may not identify all problematic VGBs in new environmental contexts. As such, we are currently working on an update to address this. The update will remove bad genomes from the VGB set (markers & ORFs), while preserving VGBs themselves (e.g. preventing the problem that arises by self-filtering as you suggested, whereby a VGB with 5 ‘good’ viral genomes and 1 ‘bad’ plasmid genome is ‘poisoned’ entirely by the presence of the 1 bad genome - whether or not that genome itself actually recruits reads or is the reason a VGB is identified in BAQLaVa’s workflow).
For now, you may use the remove_plasmid.py script and associated set of VGBs, or wait for the plasmid update to be released shortly which will be a more long-term solve. Please note that the v1.1.0-alpha branch is in development and not yet ready for use.