Realizing I need full genomes and gene calls to build the HUMANN database which will include prokaryotes and eukaryotes.
My question is what is minimum workflow to get a full scale custom humann database. Do I need to build a metaphlan database and chocophlan database? Any other databases I will need to build?
Thanks for suggesting this. I looked into an it looks very useful! Unfortunately they don’t create any metaphlan database and it doesn’t work for eukaryotic organisms.
I’m going to write a wrapper to do this from genomes (along with either GFF and/or proteins).
Regarding the metaphlan genes, should I be using core markers or could I just use anything with a uniref hit?
Also, do I need to create a chocophlan db if I create a metaphlan db? Or is that a dependency to create a metaphlan db?
Basically, I have a bunch of genomes with gene calls for a marine environment and I’d like to build out a custom marine db for HUMANN and metaphlan. It will be publicly available once it’s finished.
The MetaPhlAn marker database is a subset of the genes that get included in the ChocoPhlAn (pangenome) database. Specifically the markers are genes within a pangenome that are core to the genomes in that pangenome (i.e. found in all of them) and unique to the genomes in that pangenome (i.e. not found in other pangenomes). In practice you might not get enough markers that are 100% core and 100% unique, so the goal is to have a few 100 that are as core and as unique as possible and then do a robust average over them.
This is really helpful. For a general workflow, would recommend clustering the genomes and then finding conserved proteins within that cluster?
How do you balance between including/excluding a genome vs including/excluding a protein? For example if there is one genome that is lacking a bunch of the proteins and without it you would get way more conserved proteins in the pangenome?
That is getting into some very interesting (but deep!) questions about how to reconstruct and markerize species pangenomes. I recommend checking out the methods of these two papers for our current recommendations in this area:
The MetaPhlAn marker database is a subset of the genes that get included in the ChocoPhlAn (pangenome) database. Specifically the markers are genes within a pangenome that are core to the genomes in that pangenome (i.e. found in all of them) and unique to the genomes in that pangenome (i.e. not found in other pangenomes). In practice you might not get enough markers that are 100% core and 100% unique, so the goal is to have a few 100 that are as core and as unique as possible and then do a robust average over them.
This is going to be tricky. I’ve clustered all of the proteins w/in a species cluster but I guess I’ll need to cluster those representatives to see if they are unique to the cluster.
I have a few follow up questions:
Is there any code available for how the default HUMAnN, Metaphlan, and Chocophlan databases were created? I’ve seen this post but there weren’t any responses: Chocophlan source code
Is it preferred to have a Metaphlan and Chocophlan database when running HUMAnN or can you get comparable results using just the proteins?
Should we expect to have a 1-to-1 relationship between the protein and nucleotide sequences?
I’d like to get started on this but I’m just a little confused on where to start exactly and which resources to follow to generate a fully operational custom HUMAnN and Metaphlan database.