Using BAQLaVa Full-Length Genomes and Marker/Protein ID Formats

Inquiry: Using BAQLaVa Full-Length Genomes and Marker/Protein ID Formats

Dear BAQLaVa Team,

Hello, my name is Min-uk Park, a researcher at Seoul National University. I am currently conducting virome profiling research using BAQLaVa.

After completing the basic VGB profiling, I downloaded the full-length genomes from the provided link (BAQLaVa.V0.5.raw_databases.tar.gz) to proceed with further downstream analysis. I have two questions regarding this:

1. Differences between the two FASTA files and usage recommendations Upon extracting the archive, I noticed two files: BAQLaVa_nucleotidedb.fasta and BAQLaVa_nucleotidedb_dereplicated.fasta.

  • What are the specific differences between these two files?

  • For general downstream analyses, such as read mapping or functional annotation, which of these files is officially recommended to be used as the reference?

2. Meaning of suffixes (_1, _2, _ORF) in Marker/Protein IDs and mapping methods When reviewing the tempfile_markers.txt and tempfile_proteins.txt files generated from the profiling, I noticed suffixes like _1, _2, or _ORF attached to the marker and protein names (e.g., BAQ00000074_1, BAQ00000074_1_ORF).

  • What exactly do these numbers signify? (For instance, do they represent a specific segment position within the original contig, or the sequential order of the ORFs?)

  • What is the best practice for mapping and connecting these marker/protein IDs back to the provided full-genome FASTA files for downstream analysis (e.g., discovering specific functional genes, extracting marker sequences)?

Thank you for developing such an excellent tool. I look forward to hearing from you.

Best regards,

Min-uk, Park

Seoul National University

1. Differences between the two FASTA files and usage recommendations - what are the differences and which should I use?
BAQLaVa_nucleotidedb.fasta corresponds to all genomes originally collected for the BAQLaVa database. BAQLaVa_nucleotidedb_dereplicated.fasta represents the database after it was dereplicated with MMseqs2 at MIUViG thresholds (95% identity over 85% coverage of the shorter genome) - this also corresponds to clustering stage 1 in the manuscript (https://www.biorxiv.org/content/10.64898/2026.02.11.705346v1). For most analysis, BAQLaVa_nucleotidedb_dereplicated.fasta should be sufficient.

2. Meaning of suffixes (_1, _2, _ORF) in Marker/Protein IDs and mapping methods and how to use them?
These numbers are just internal signifiers used to differentiate and track the different markers and ORFs. They are simply numbered sequentially along the genome. Please note that the nucleotide markers to not represent specific functional genes, just continuous VGB-unique genetic content. The markers and ORFs utilized by BAQLaVa and reported in the tempfiles are intentionally unique to each VGB, so they do not represent the full range of gene carriage by a given VGB, which may limit the direct application of this tempfile to functional analysis.
Nonetheless, the set of markers and proteins are available for download here: https://huttenhower.sph.harvard.edu/baqlava-db/BAQLaVa_markers_ORFs.tar.gz
Please note we do not provide the location information for what position these each originate at, so if you need that you will need to map these back to the genomes yourself to source that.
We additionally provide a metadata file which was used in the manuscript to characterize ORFs with both PFAMs (InterPro) or Vfams (vogdb.org), which can be found here: baqlava/baqlava/utility_files/BAQLaVa_metadata_ORFs.txt.gz at master · biobakery/baqlava · GitHub