Hi Guys,
Firstly thank you for developing this great suit of tools. I am in the process of developing a nextflow metaphlan3 module and one of the requirements for module submission to the nf-core community is having a small toy database and dataset available. As the metaphlan DB is quite large, I had hoped to simply use bowtie2 to index afasta file with a couple of genes from SARS-CoV-2 genome (the nf-core community use this for testing), create a pickle file with these genes and provide the path to this DB for testing.
I have attempted to create these following the instructions provided at: [BUG] database installation error · Issue #103 · biobakery/MetaPhlAn · GitHub and have so far:
i) created a fasta file with a nucleotide sequences from SARS-CoV2 genes
ii) indexed these using bowtie2-build
iii) modified the ‘customising the database’ script in the docs to add the SARS-CoV2 marker gene data (see below)
iv) provided path to new methaphlan db in run with 'add_viruses` enabled
I can see that the database is ‘working’ as sequences are being aligned to the markers (see bowtie2 out) but it still fails to classify the sequences as SARS-CoV-2… I feel like I may have oversimplified a couple of steps in constructing the DB and would appreciate some guidance.
prepare_metaphlan_db script:
import pickle
import bz2
db = pickle.load(bz2.open('metaphlan_databases/mpa_v30_CHOCOPhlAn_201901.pkl', 'r'))
print(list(db['markers'].items())[-1])
for k in db:
db[k].clear() #remove all entires
# Add the taxonomy of the new genomes
db['taxonomy']['k__Viruses|p__Pisuviricota|c__Pisoniviricetes|o__Nidovirales|f__Coronaviridae|g__Betacoronavirus|s__Severe_acute_respiratory_syndrome_coronavirus_2|t__GCA_009858895.3'] = ('||||||694009', 29903)
# Add the information of the new marker as the other markers
db['markers']['694009_GeneID:43740576'] = { #name of header in marker gene fasta arbitary, just need to match maker in these pickle files!
'clade': 's__Severe_acute_respiratory_syndrome_coronavirus_2',
'ext': [],
'len': 117,
'taxon': 'k__Viruses|p__Pisuviricota|c__Pisdoniviricetes|o__Nidovirales|f__Coronaviridae|g__Betacoronavirus|s__Severe_acute_respiratory_syndrome_coronavirus_2'
}
# To see an example, try to print the first marker information:
print(list(db['markers'].items())[0])
print(list(db['taxonomy'].items())[0])
# Save the new mpa_pkl file
with bz2.BZ2File('metaphlan3_new_db/mpa_v30_CHOCOPhlAn_201901.pkl', 'w') as ofile:
pickle.dump(db, ofile, pickle.HIGHEST_PROTOCOL)
```````````````````````
output test.txt
`````
SampleID Metaphlan_Analysis
#clade_name NCBI_tax_id relative_abundance additional_species
UNKNOWN -1 100.0
`````