Metaphlan database problem

Hello

By referring to the metaphlan.py file, Metaphlan run requires two important database files: a .pkl file and a Bowtie2 index file.

I downloaded the file “mpa_vOct22_CHOCOPhlAnSGB_202403.tar”(http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vOct22_CHOCOPhlAnSGB_202403.tar), and decompress, it includes a.pkl file, _SGB.fna.bz2, _VINFO.csv, and _VSG.fna.bz2.

The _SGB.fna.bz2 file contains 7,339,971 sequences.

But, in .pkl file, Its information is inconsistent with that of the previous _SGB.fna.bz2 file.

import pickle
import bz2

# open pkl file
db = pickle.load(bz2.open('mpa_vOct22_CHOCOPhlAnSGB_202403.pkl', 'r'))
db.keys()
#dict_keys(['taxonomy', 'markers', 'merged_taxon'])

#taxonomy
count_taxa = 0
for taxa in db['taxonomy']: count_taxa = count_taxa + 1
print(count_taxa)					#30216 species

#markers
count_markers = 0
for marker in db['markers']: count_markers = count_markers + 1
print(count_markers)				#5751328 marker gene
# in _SGB.fna.bz2 file contains 7,339,971 sequences.

So, please check this tar file.

Hi @ZhaoHuiyao

I downloaded the tar file to check and the mpa_vOct22_CHOCOPhlAnSGB_202403.fna.bz2 file has 5,843,065 sequences, which correspond to the 5,751,328 present in the pkl file plus viral sequences which are not in the pickle (and shouldn’t be). Could you please repeat your download and let me know if the problem persists?