I’d like to know what the difference is between the chocophlan database hosted here: https://bitbucket.org/biobakery/metaphlan2/downloads/ (this is 366Mb)
And the one that is downloaded via the command line:
humann2_databases --download chocophlan full $INSTALL_LOCATION
(this one is 5.5Gb)
I mean, other than the obvious difference in size, what is each one used for? They’re both called chocophlan, so I’m guessing they’re both used for running metaphlan.
ChocoPhlAn is the underlying pipeline that builds the species pan-genomes, it identifies from them the MetaPhlAn makers (the 366Mb file in the MetaPhlAn repository), the HUMAnN centroids and functional annotation (the file retrieved with
humann2_databases --download chocophlan)
Thanks! Although I’m still not understanding.
I thought the pipeline was called metaphlan and chocophlan was the database (containing species-specific markers), or am I mistaken?
I thought the humann centroids and functional annotation was stored in the UniRef database.
As @fbeghini said, ChocoPhlAn is the pipeline that builds the pangenomes, and we often refer to the resulting pangenomes as the “ChocoPhlAn database.” The marker genes are a unique, conserved subset of each species’ pangenome, so in total they are a subset of ChocoPhlAn.
UniRef is a clustering of the protein universe maintained and updated by UniProt. The ChocoPhlAn pangenomes were historically mapped against UniRef to identify broader gene families and known functional annotations (the modern ChocoPhlAn pipeline actually uses the UniRef clustering to aid in pangenome construction). Hope this clarifies the remaining confusion!
I’d like to know if any tools can be used for constructing a custom chocoplan database.
I have a set of 170,000 microbial reference genomes including bacteria, archaea, and fungi. Is there any pipeline (chocoplan pipeline) for general users that can extract markers from each genome and construct a custom marker-gene database? Thank you so much!