What’s is needed to build a custom HUMANN database?

jolespin · July 2, 2023, 9:02pm

I’m trying to follow along here:

Realizing I need full genomes and gene calls to build the HUMANN database which will include prokaryotes and eukaryotes.

My question is what is minimum workflow to get a full scale custom humann database. Do I need to build a metaphlan database and chocophlan database? Any other databases I will need to build?

franzosa · July 7, 2023, 6:43pm

There’s some recent discussion about this here:

jolespin · July 18, 2023, 11:26pm

Thanks for suggesting this. I looked into an it looks very useful! Unfortunately they don’t create any metaphlan database and it doesn’t work for eukaryotic organisms.

I’m going to write a wrapper to do this from genomes (along with either GFF and/or proteins).

Regarding the metaphlan genes, should I be using core markers or could I just use anything with a uniref hit?

Also, do I need to create a chocophlan db if I create a metaphlan db? Or is that a dependency to create a metaphlan db?

Basically, I have a bunch of genomes with gene calls for a marine environment and I’d like to build out a custom marine db for HUMANN and metaphlan. It will be publicly available once it’s finished.

franzosa · July 20, 2023, 7:52pm

The MetaPhlAn marker database is a subset of the genes that get included in the ChocoPhlAn (pangenome) database. Specifically the markers are genes within a pangenome that are core to the genomes in that pangenome (i.e. found in all of them) and unique to the genomes in that pangenome (i.e. not found in other pangenomes). In practice you might not get enough markers that are 100% core and 100% unique, so the goal is to have a few 100 that are as core and as unique as possible and then do a robust average over them.

jolespin · August 18, 2023, 4:13pm

This is really helpful. For a general workflow, would recommend clustering the genomes and then finding conserved proteins within that cluster?

How do you balance between including/excluding a genome vs including/excluding a protein? For example if there is one genome that is lacking a bunch of the proteins and without it you would get way more conserved proteins in the pangenome?

franzosa · August 18, 2023, 6:41pm

That is getting into some very interesting (but deep!) questions about how to reconstruct and markerize species pangenomes. I recommend checking out the methods of these two papers for our current recommendations in this area:

jolespin · August 30, 2023, 10:14pm

Finally circling around to this.

I’m taking a look here: GitHub - biobakery/humann: HUMAnN is the next generation of HUMAnN 1.0 (HMP Unified Metabolic Analysis Network).

The MetaPhlAn marker database is a subset of the genes that get included in the ChocoPhlAn (pangenome) database. Specifically the markers are genes within a pangenome that are core to the genomes in that pangenome (i.e. found in all of them) and unique to the genomes in that pangenome (i.e. not found in other pangenomes). In practice you might not get enough markers that are 100% core and 100% unique, so the goal is to have a few 100 that are as core and as unique as possible and then do a robust average over them.

This is going to be tricky. I’ve clustered all of the proteins w/in a species cluster but I guess I’ll need to cluster those representatives to see if they are unique to the cluster.

I have a few follow up questions:

Is there any code available for how the default HUMAnN, Metaphlan, and Chocophlan databases were created? I’ve seen this post but there weren’t any responses: Chocophlan source code
Is it preferred to have a Metaphlan and Chocophlan database when running HUMAnN or can you get comparable results using just the proteins?
Should we expect to have a 1-to-1 relationship between the protein and nucleotide sequences?

I’d like to get started on this but I’m just a little confused on where to start exactly and which resources to follow to generate a fully operational custom HUMAnN and Metaphlan database.

Topic		Replies	Views
Constructing custom database de novo MetaPhlAn	2	840	July 22, 2022
Thoughts on custom humann3 reference databases HUMAnN	7	1813	April 3, 2023
Humann3/Chocophlan and metaphlan3 databases compatible? HUMAnN	4	1436	December 1, 2022
Building a custom marker genes DB for running Metaphlan MetaPhlAn	1	457	July 18, 2022
ChocoPhlAn/UniRef 201901b vs 201901 HUMAnN	3	1139	September 3, 2021

What’s is needed to build a custom HUMANN database?

Related topics