Inquiry regarding the identification and removal of false-positive plasmids in non-iHMP datasets

alsdnr2295 · February 25, 2026, 10:10am

Inquiry on reproducing the method for identifying false-positive plasmids (re: remove_plasmid.py)

Hello BAQLaVa team,

my name is min-uk, Park

Thank you for developing this great pipeline. While reviewing the scripts, I noticed remove_plasmid.py and the associated lists (MGX_plasmid_VGBs.txt and MTX_plasmid_VGBs.txt) which are used to filter out false-positive VGBs (plasmids) found in the iHMP dataset.

We are highly interested in applying a similar filtering approach to our own dataset to identify and remove any dataset-specific plasmids or host DNA that might be misclassified as viruses.

Could you please share how we can reproduce your approach? Specifically, we would like to ask:

Reproduction of iHMP filtering: What exact pipeline, tools, or criteria did you use to generate the MGX/MTX_plasmid_VGBs.txt lists from the iHMP dataset?
Recommended Workflow: If we want to check for plasmid sequences in a new dataset, would you recommend extracting the fasta sequences of the identified VGBs and running them through tools like geNomad or CheckV? Or is there a specific workflow you suggest for this purpose?

Thank you for your time and guidance!

Best regards,

min-uk, Park

seoul national university

jjensen44 · March 6, 2026, 12:17am

This approach was specific to the HMP2 viral profile dataset. In analysis, we identified that a small number of VGBs were not behaving as expected and potentially contained plasmid genetic material along with viral genomes within the VGB (though this is also complicated by the existence of phage-plasmids, etc… its a messy world!).

To address this, after BAQLaVa profiling, we used the marker & ORF tempfiles (representing the mapped abundance of each) and PPR-meta plasmid probability scores to identify VGBs that when observed by BAQLaVa in our dataset, were likely due to markers or ORFs originating from plasmid-like material rather than viral-like material (e.g. the markers we observed mapped to were those from genomes classified by PPR-meta as likely-plasmid). We did not systematically remove these VGBs in the manuscript analysis, but we did ensure that any highlighted examples of interesting viruses or viral traits were not originating from these VGBs. However, other laboratory members doing subsequent analysis with BAQLaVa on new datasets have since been removing this set of VGBs with the remove_plasmid.py script and associated set of VGBs.

This approach is imperfect in that it will remove the set of VGBs identified as problematic within our specific dataset, but it may not identify all problematic VGBs in new environmental contexts. As such, we are currently working on an update to address this. The update will remove bad genomes from the VGB set (markers & ORFs), while preserving VGBs themselves (e.g. preventing the problem that arises by self-filtering as you suggested, whereby a VGB with 5 ‘good’ viral genomes and 1 ‘bad’ plasmid genome is ‘poisoned’ entirely by the presence of the 1 bad genome - whether or not that genome itself actually recruits reads or is the reason a VGB is identified in BAQLaVa’s workflow).

For now, you may use the remove_plasmid.py script and associated set of VGBs, or wait for the plasmid update to be released shortly which will be a more long-term solve. Please note that the v1.1.0-alpha branch is in development and not yet ready for use.

Topic		Replies	Views
Baqlava marker file has markers not seen in the profile BAQLaVa	1	58	June 20, 2025
Filtering settings StrainPhlAn 4.1 StrainPhlAn	8	225	February 26, 2025
Using BAQLaVa Full-Length Genomes and Marker/Protein ID Formats BAQLaVa	1	18	March 5, 2026
Strainphlan sample2markers.py ERROR StrainPhlAn	5	117	December 18, 2025
Customizing Chochophlan panproteome and Metaphlan marker gene databases with new taxa MetaPhlAn	10	1498	May 20, 2024

Inquiry regarding the identification and removal of false-positive plasmids in non-iHMP datasets

Related topics