The bioBakery help forum

Creating toy custom database for DSL2 nextflow module development

Hi Guys,

Firstly thank you for developing this great suit of tools. I am in the process of developing a nextflow metaphlan3 module and one of the requirements for module submission to the nf-core community is having a small toy database and dataset available. As the metaphlan DB is quite large, I had hoped to simply use bowtie2 to index afasta file with a couple of genes from SARS-CoV-2 genome (the nf-core community use this for testing), create a pickle file with these genes and provide the path to this DB for testing.

I have attempted to create these following the instructions provided at: [BUG] database installation error · Issue #103 · biobakery/MetaPhlAn · GitHub and have so far:
i) created a fasta file with a nucleotide sequences from SARS-CoV2 genes
ii) indexed these using bowtie2-build
iii) modified the ‘customising the database’ script in the docs to add the SARS-CoV2 marker gene data (see below)
iv) provided path to new methaphlan db in run with 'add_viruses` enabled

I can see that the database is ‘working’ as sequences are being aligned to the markers (see bowtie2 out) but it still fails to classify the sequences as SARS-CoV-2… I feel like I may have oversimplified a couple of steps in constructing the DB and would appreciate some guidance.

prepare_metaphlan_db script:

import pickle
import bz2

db = pickle.load(bz2.open('metaphlan_databases/mpa_v30_CHOCOPhlAn_201901.pkl', 'r'))

print(list(db['markers'].items())[-1])

for k in db:

    db[k].clear() #remove all entires

# Add the taxonomy of the new genomes

db['taxonomy']['k__Viruses|p__Pisuviricota|c__Pisoniviricetes|o__Nidovirales|f__Coronaviridae|g__Betacoronavirus|s__Severe_acute_respiratory_syndrome_coronavirus_2|t__GCA_009858895.3'] = ('||||||694009', 29903)

# Add the information of the new marker as the other markers

db['markers']['694009_GeneID:43740576'] = { #name of header in marker gene fasta arbitary, just need to match maker in these pickle files!

                                   'clade': 's__Severe_acute_respiratory_syndrome_coronavirus_2',

                                   'ext': [],

                                   'len': 117,

                                   'taxon':  'k__Viruses|p__Pisuviricota|c__Pisdoniviricetes|o__Nidovirales|f__Coronaviridae|g__Betacoronavirus|s__Severe_acute_respiratory_syndrome_coronavirus_2'

                                }

# To see an example, try to print the first marker information:

print(list(db['markers'].items())[0])

print(list(db['taxonomy'].items())[0])

# Save the new mpa_pkl file

with bz2.BZ2File('metaphlan3_new_db/mpa_v30_CHOCOPhlAn_201901.pkl', 'w') as ofile:

    pickle.dump(db, ofile, pickle.HIGHEST_PROTOCOL)
```````````````````````

output test.txt
`````
SampleID       Metaphlan_Analysis
#clade_name     NCBI_tax_id     relative_abundance      additional_species
UNKNOWN -1      100.0
`````

I have also wondered about the structure of the database. Neither the journal article describing it nor the GitHub website provide a useful specification of the database format. The file name of the database for download has 2019 in it and I figure that with the popularity of long read sequencing, the number of bacterial genomes are increasing at a high rate. It would be nice to have a 2021 database for download, or at least a reproducible workflow for how the 2019 downloadable one was created, neither which exist to my knowledge.

The steps are correct and there is no oversimplification in this, after identifying candidate markers, this is enough for creating the pickle.

About the reporting of 100% UNKNOWN, can you share the bt2out file? Having 100% UNKNOWN reported it may mean that there is no enough coverage in order to have a relative abundance report or very few markers have been mapped to the metagenome.

Hi Francesco,

Thanks for the response, see *bowtie2out below:

lcl|MT192765.1_cds_QIK50431.1_6_[gene=ORF6]_[protein=ORF6_protein]_[protein_id=QIK50431.1]_[location=27195..27380]_[gbkey=CDS]         694009_GeneID:43740572
lcl|MT192765.1_cds_QIK50436.1_11_[gene=ORF10]_[protein=ORF10_protein]_[protein_id=QIK50436.1]_[location=29551..29667]_[gbkey=CDS]    694009_GeneID:43740576
#nreads 11
#avg_read_length        732.1818181818181

Regarding the unknown output, this makes sense when using the fastq files as input as the reads have been significantly subsampled. However, I was wondering why SARS CoV2 was not detected when using the fasta file as input considering sequences were aligned to two markers? Is this due to the fact only 2 markers were mapped?

When developing these nextflow modules It’s strongly recommended to use data available on the github repo for testing, although this seems a good case for making a more representative metagenomic test dataset available. The key issue really I’m running into really is the size of the marker database, so I was hoping to create a toy SARS-CoV2 DB that could be hosted on the repo and used for testing.

I see here you are using as input a metagenome produced using long reads. MetaPhlAn is supposed to work with short read, however you can try using a local alignment instead the default end-to-end by running it with --bt2_ps very-sensitive-local.

We made available in the bioBakery tutorial a couple of subsampled HMP metagenomes for testing purposes, you can find it at metaphlan3 · biobakery/biobakery Wiki (github.com)

1 Like