What’s is needed to build a custom HUMANN database?

Finally circling around to this.

I’m taking a look here: GitHub - biobakery/humann: HUMAnN is the next generation of HUMAnN 1.0 (HMP Unified Metabolic Analysis Network).

The MetaPhlAn marker database is a subset of the genes that get included in the ChocoPhlAn (pangenome) database. Specifically the markers are genes within a pangenome that are core to the genomes in that pangenome (i.e. found in all of them) and unique to the genomes in that pangenome (i.e. not found in other pangenomes). In practice you might not get enough markers that are 100% core and 100% unique, so the goal is to have a few 100 that are as core and as unique as possible and then do a robust average over them.

This is going to be tricky. I’ve clustered all of the proteins w/in a species cluster but I guess I’ll need to cluster those representatives to see if they are unique to the cluster.

I have a few follow up questions:

  • Is there any code available for how the default HUMAnN, Metaphlan, and Chocophlan databases were created? I’ve seen this post but there weren’t any responses: Chocophlan source code

  • Is it preferred to have a Metaphlan and Chocophlan database when running HUMAnN or can you get comparable results using just the proteins?

  • Should we expect to have a 1-to-1 relationship between the protein and nucleotide sequences?

I’d like to get started on this but I’m just a little confused on where to start exactly and which resources to follow to generate a fully operational custom HUMAnN and Metaphlan database.