Following these guidelines:
import pickle
import bz2
db = pickle.load(bz2.open('metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103.pkl', 'r'))
# Add the taxonomy of the new genomes
db['taxonomy']['7-levels taxonomy with clade names of genome1'] = ('7-levels NCBI taxonomy id of genome1', length of genome1)
db['taxonomy']['7-levels taxonomy with clade names of genome2'] = ('7-levels NCBI taxonomy id of genome1', length of genome2)
# Add the information of the new marker as the other markers
db['markers'][new_marker_name] = {
'clade': the clade that the marker belongs to,
'ext': {the GCA of the first external genome where the marker appears,
the GCA of the second external genome where the marker appears,
},
'len': length of the marker,
'taxon': the taxon of the marker
}
To see an example, try to print the first marker information:
print list(db['markers'].items())[0]
# Save the new mpa_pkl file
with bz2.BZ2File('metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_NEW.pkl', 'w') as ofile:
pickle.dump(db, ofile, pickle.HIGHEST_PROTOCOL)
The reason why I am asking is because there are many eukaryotes and viruses that do not have this clearly defined structure:
echo "65574" | taxonkit lineage -R
65574 cellular organisms;Eukaryota;Sar;Rhizaria;Retaria;Acantharea no rank;superkingdom;clade;clade;clade;class
echo "335924" | taxonkit lineage -R
335924 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Myoviridae;unclassified Myoviridae;Cyanobacteria phage AS-1 superkingdom;clade;kingdom;phylum;class;family;no rank;species
Is it possible to add to the database (or build custom databases) that do not have taxonomy with 7-levels defined?