Creating toy custom database for DSL2 nextflow module development

MGordon · May 28, 2021, 8:00am

Hi Guys,

Firstly thank you for developing this great suit of tools. I am in the process of developing a nextflow metaphlan3 module and one of the requirements for module submission to the nf-core community is having a small toy database and dataset available. As the metaphlan DB is quite large, I had hoped to simply use bowtie2 to index afasta file with a couple of genes from SARS-CoV-2 genome (the nf-core community use this for testing), create a pickle file with these genes and provide the path to this DB for testing.

I have attempted to create these following the instructions provided at: [BUG] database installation error · Issue #103 · biobakery/MetaPhlAn · GitHub and have so far:
i) created a fasta file with a nucleotide sequences from SARS-CoV2 genes
ii) indexed these using bowtie2-build
iii) modified the ‘customising the database’ script in the docs to add the SARS-CoV2 marker gene data (see below)
iv) provided path to new methaphlan db in run with 'add_viruses` enabled

I can see that the database is ‘working’ as sequences are being aligned to the markers (see bowtie2 out) but it still fails to classify the sequences as SARS-CoV-2… I feel like I may have oversimplified a couple of steps in constructing the DB and would appreciate some guidance.

prepare_metaphlan_db script:

import pickle
import bz2

db = pickle.load(bz2.open('metaphlan_databases/mpa_v30_CHOCOPhlAn_201901.pkl', 'r'))

print(list(db['markers'].items())[-1])

for k in db:

    db[k].clear() #remove all entires

# Add the taxonomy of the new genomes

db['taxonomy']['k__Viruses|p__Pisuviricota|c__Pisoniviricetes|o__Nidovirales|f__Coronaviridae|g__Betacoronavirus|s__Severe_acute_respiratory_syndrome_coronavirus_2|t__GCA_009858895.3'] = ('||||||694009', 29903)

# Add the information of the new marker as the other markers

db['markers']['694009_GeneID:43740576'] = { #name of header in marker gene fasta arbitary, just need to match maker in these pickle files!

                                   'clade': 's__Severe_acute_respiratory_syndrome_coronavirus_2',

                                   'ext': [],

                                   'len': 117,

                                   'taxon':  'k__Viruses|p__Pisuviricota|c__Pisdoniviricetes|o__Nidovirales|f__Coronaviridae|g__Betacoronavirus|s__Severe_acute_respiratory_syndrome_coronavirus_2'

                                }

# To see an example, try to print the first marker information:

print(list(db['markers'].items())[0])

print(list(db['taxonomy'].items())[0])

# Save the new mpa_pkl file

with bz2.BZ2File('metaphlan3_new_db/mpa_v30_CHOCOPhlAn_201901.pkl', 'w') as ofile:

    pickle.dump(db, ofile, pickle.HIGHEST_PROTOCOL)
```````````````````````

output test.txt
`````
SampleID       Metaphlan_Analysis
#clade_name     NCBI_tax_id     relative_abundance      additional_species
UNKNOWN -1      100.0
`````

Dario · May 30, 2021, 12:00am

I have also wondered about the structure of the database. Neither the journal article describing it nor the GitHub website provide a useful specification of the database format. The file name of the database for download has 2019 in it and I figure that with the popularity of long read sequencing, the number of bacterial genomes are increasing at a high rate. It would be nice to have a 2021 database for download, or at least a reproducible workflow for how the 2019 downloadable one was created, neither which exist to my knowledge.

fbeghini · June 1, 2021, 1:05pm

The steps are correct and there is no oversimplification in this, after identifying candidate markers, this is enough for creating the pickle.

About the reporting of 100% UNKNOWN, can you share the bt2out file? Having 100% UNKNOWN reported it may mean that there is no enough coverage in order to have a relative abundance report or very few markers have been mapped to the metagenome.

MGordon · June 1, 2021, 9:58pm

Hi Francesco,

Thanks for the response, see *bowtie2out below:

lcl|MT192765.1_cds_QIK50431.1_6_[gene=ORF6]_[protein=ORF6_protein]_[protein_id=QIK50431.1]_[location=27195..27380]_[gbkey=CDS]         694009_GeneID:43740572
lcl|MT192765.1_cds_QIK50436.1_11_[gene=ORF10]_[protein=ORF10_protein]_[protein_id=QIK50436.1]_[location=29551..29667]_[gbkey=CDS]    694009_GeneID:43740576
#nreads 11
#avg_read_length        732.1818181818181

Regarding the unknown output, this makes sense when using the fastq files as input as the reads have been significantly subsampled. However, I was wondering why SARS CoV2 was not detected when using the fasta file as input considering sequences were aligned to two markers? Is this due to the fact only 2 markers were mapped?

When developing these nextflow modules It’s strongly recommended to use data available on the github repo for testing, although this seems a good case for making a more representative metagenomic test dataset available. The key issue really I’m running into really is the size of the marker database, so I was hoping to create a toy SARS-CoV2 DB that could be hosted on the repo and used for testing.

fbeghini · June 3, 2021, 12:36pm

I see here you are using as input a metagenome produced using long reads. MetaPhlAn is supposed to work with short read, however you can try using a local alignment instead the default end-to-end by running it with --bt2_ps very-sensitive-local.

We made available in the bioBakery tutorial a couple of subsampled HMP metagenomes for testing purposes, you can find it at metaphlan3 · biobakery/biobakery Wiki (github.com)

Shi_Huang · October 18, 2021, 7:25am

Hello Francesco,
I’d like to know if we can de novo construct a custom metaphlan database starting from a set of microbial genomes.
Currently, I have a set of 170,000 microbial reference genomes including bacteria, archaea, and fungi. Is there any pipeline (chocoplan pipeline) for general users that can extract markers from each genome and construct a custom marker-gene database for taxonomic profiling?

Thank you so much!

Topic		Replies	Views
Metaphlan database problem MetaPhlAn	2	100	January 23, 2025
Getting HUMAnN2 and MetaPhlAn2 to run together HUMAnN	6	1793	May 10, 2020
Help with PanPhlAn tutorial PanPhlAn	6	1421	May 11, 2020
How do I create the .pkl required to run metaphlan? MetaPhlAn	2	472	February 8, 2023
Metaphlan marker level analysis MetaPhlAn	6	899	July 22, 2020

Creating toy custom database for DSL2 nextflow module development

Related topics