Hello
By referring to the metaphlan.py file, Metaphlan run requires two important database files: a .pkl file and a Bowtie2 index file.
I downloaded the file “mpa_vOct22_CHOCOPhlAnSGB_202403.tar”(http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vOct22_CHOCOPhlAnSGB_202403.tar), and decompress, it includes a.pkl file, _SGB.fna.bz2, _VINFO.csv, and _VSG.fna.bz2.
The _SGB.fna.bz2 file contains 7,339,971 sequences.
But, in .pkl file, Its information is inconsistent with that of the previous _SGB.fna.bz2 file.
import pickle
import bz2
# open pkl file
db = pickle.load(bz2.open('mpa_vOct22_CHOCOPhlAnSGB_202403.pkl', 'r'))
db.keys()
#dict_keys(['taxonomy', 'markers', 'merged_taxon'])
#taxonomy
count_taxa = 0
for taxa in db['taxonomy']: count_taxa = count_taxa + 1
print(count_taxa) #30216 species
#markers
count_markers = 0
for marker in db['markers']: count_markers = count_markers + 1
print(count_markers) #5751328 marker gene
# in _SGB.fna.bz2 file contains 7,339,971 sequences.
So, please check this tar file.
Hi @ZhaoHuiyao
I downloaded the tar file to check and the mpa_vOct22_CHOCOPhlAnSGB_202403.fna.bz2 file has 5,843,065 sequences, which correspond to the 5,751,328 present in the pkl file plus viral sequences which are not in the pickle (and shouldn’t be). Could you please repeat your download and let me know if the problem persists?
Hi, Claudia
I encountered a similar issue and the statistical results are consistent with @zhaohuiyao. The number of sequences in “mpa_vOct22_CHOCOPhlAnSGB_202403.fna” differs significantly from the number of markers in “mpa_vOct22_CHOCOPhlAnSGB_202403_marker_info.txt”.
Meanwhile, the IDs in “marker_info.txt” cannot be matched to the IDs in the “.fna” . For example, “>319626__genemark-KZ824286.1-processed-gene-0.27-mRNA-1:cds_53157:53330__genemark-KZ824286.1-processed-gene-0.27-mRNA-1:cds_53157:53330 k__Eukaryota|p__Ascomycota|c__Eurotiomycetes|o__Eurotiales|f__Aspergillus|s__Aspergillus_homomorphus;GCA_003184865” cannot be found in “marker_info.txt”. Additionally, IDs related to EUK5691 (EUK5691__C9ZM76__TbgDal_IV4480) cannot be found in the “fna file”.


Based on this, I have the following questions:
- Under normal circumstances, does the “mpa_vOct22_CHOCOPhlAnSGB_202403.fna” contain the marker gene sequences for all species in the current database version?
- Should the “mpa_vOct22_CHOCOPhlAnSGB_202403_marker_info.txt” include the sequence IDs of all marker genes?
- The ID naming conventions between the two files seem inconsistent. The “.fna” file appears to have two formats: one with “Uniref ID” and “SGB ID” (
UniRef90_UPI000E65A0B4|5__12|SGB32561), and another with “clade ” and “GAC ID”. If I want to extract marker genes for a specific species, how should I proceed?
Best,
Joshua