Thoughts on custom humann3 reference databases


I was hoping you could give me some general feedback regarding the use of custom humann3 reference databases:

My idea is to use a recently published collection of mouse gastrointestinal bacterial genomes to generate custom nucleotide and translated reference databases for humann3 (see GitHub - BenBeresfordJones/MGBC: The Mouse Gastrointestinal Bacteria Catalogue and linked publication for more info). This collection contains ~26,400 high-quality bacterial genomes assembled from mouse gut metagenomes. Using available kraken/bracken databases generated from this collection I get very complete species level taxonomic classification of my sample reads (~95% compared to 30-60% with ncbi).

I was thinking that making custom databases for humann3 from this collection may similarly enable much more complete mapping of my sample reads to functional units (genes/reactions/pathways) and would make it easier to compare the humann3 outputs to my kraken/bracken taxonomic profiles. I think once I have custom nucleotide and translated databases I could adjust and use the bracken outputs as the taxonomic profile to bypass metaphlan.

However I am very new to computational analysis and am still not 100% sure if doing this makes sense. I am also a bit overwhelmed in trying to figure out the necessary steps and don’t have a good feel for how big of a job this might be in terms of computational resources.

I was hoping to get some feedback regarding the usefulness (based on my goal of more complete mapping of reads) and feasibility of making these custom databases. Also if you have any advice/ can point me to any relevant resources I would be very grateful!



I think it would make sense. We also work with kraken/bracken and have managed to provide Humann with a custom taxonomy list for the nucleotide alignment part. When it comes to creating a custom subset of the chocophlan database using our large taxonomic list, the procedure is pretty straightforward (requires KrakenTools and some basic bash commands to trick Humann into thinking it’s a list produced by metaphlan). It would be the same if you provided a custom nt reference database. I can share our code if needed.

When it comes to custom database creation, the nt and prot databases must fit certain criteria. I recommend reading this section of the humann wiki (and the two below).

In our case, we want our protein database to focus on certain biogeochemical reactions, hence provide a custom database for the translated (diamond) search. We are particularly interested with NCyc database, which already seems to be formatted for DIAMOND. We are going to be testing this shortly. It would also accelerate the translated search; currently, the mere alignment takes about 8h on ~9-15Gb samples using 24 cores and 30G memory.

In any case, any input from people with more experience than I would be much appreciated !

