I was hoping you could give me some general feedback regarding the use of custom humann3 reference databases:
My idea is to use a recently published collection of mouse gastrointestinal bacterial genomes to generate custom nucleotide and translated reference databases for humann3 (see GitHub - BenBeresfordJones/MGBC: The Mouse Gastrointestinal Bacteria Catalogue and linked publication for more info). This collection contains ~26,400 high-quality bacterial genomes assembled from mouse gut metagenomes. Using available kraken/bracken databases generated from this collection I get very complete species level taxonomic classification of my sample reads (~95% compared to 30-60% with ncbi).
I was thinking that making custom databases for humann3 from this collection may similarly enable much more complete mapping of my sample reads to functional units (genes/reactions/pathways) and would make it easier to compare the humann3 outputs to my kraken/bracken taxonomic profiles. I think once I have custom nucleotide and translated databases I could adjust and use the bracken outputs as the taxonomic profile to bypass metaphlan.
However I am very new to computational analysis and am still not 100% sure if doing this makes sense. I am also a bit overwhelmed in trying to figure out the necessary steps and don’t have a good feel for how big of a job this might be in terms of computational resources.
I was hoping to get some feedback regarding the usefulness (based on my goal of more complete mapping of reads) and feasibility of making these custom databases. Also if you have any advice/ can point me to any relevant resources I would be very grateful!
I think it would make sense. We also work with kraken/bracken and have managed to provide Humann with a custom taxonomy list for the nucleotide alignment part. When it comes to creating a custom subset of the chocophlan database using our large taxonomic list, the procedure is pretty straightforward (requires KrakenTools and some basic bash commands to trick Humann into thinking it’s a list produced by metaphlan). It would be the same if you provided a custom nt reference database. I can share our code if needed.
When it comes to custom database creation, the nt and prot databases must fit certain criteria. I recommend reading this section of the humann wiki (and the two below).
In our case, we want our protein database to focus on certain biogeochemical reactions, hence provide a custom database for the translated (diamond) search. We are particularly interested with NCyc database, which already seems to be formatted for DIAMOND. We are going to be testing this shortly. It would also accelerate the translated search; currently, the mere alignment takes about 8h on ~9-15Gb samples using 24 cores and 30G memory.
In any case, any input from people with more experience than I would be much appreciated !
Hello,
Thanks for all the explanations. I have a question, I am converting kraken reports to mpa format with krakentools. Do you know which additional modification are needed to make it compatible with Humann, is it a specific header?
HUMAnN both looks for the MetaPhlAn 4-column format as well as a version-specific header to make sure that your taxonomic profile and pangenomes are compatible. You would have to “hack” both of these attributes to trick HUMAnN into using a taxonomic profile from another method.
And to also offer some reply to the original question, I think creating an entire alternate data ecosystem for HUMAnN would be a big undertaking (it’s a big job for us just to create and maintain the default ones). I would not recommend it as a project to someone who is just getting started in computational analysis. Note that HUMAnN’s data ecosystem involves more than the genome/pangenome files themselves - there are also a lot of important annotation files that go along with the sequence data, and without those you’d have mapping results but they wouldn’t be very interpretable.
@nbat64
I use the following bash script to hack the formatting (from the Bracken output, which I recommend you use on your Kraken output).
Kraken is known to output a ton of ultra-low abundance species (presumably a lot of false positives, a consequence of using a non-greedy approach to k-mer alignment; see Sourmash for an alternative to k-mer analysis of metagenomes), so it’s not very useful in the context of Humann to use all of these, since it will barely subset the Chocophlan index and rebuild the entire database from that. This script keeps only the top 20% species (it seems low, but kraken outputs a LOT of ultra-low abundance species). In our case, kraken would output something like 7000 species study-wide, where for example Metaphlan would have fewer than 1000; this would subset nearly half the chocophlan db.
Then you can use combine_mpa.py to create a study-wide table, and use this to build one custom reference subset database (do it once, not once per sample!). Then supply Humann with the bugs-list to the --taxonomic-profile option and the path to the custom db to the --nucleotide-database option.
@jorondo1
Hi, thank you very much for the detailed explanation. Yes, I am using Bracken too (with parameters -t 10 for abundance estimation).
So with your script you create a mpa file with only 2 columns instead of 4, it is ok? (to be sure, no need of option --percentage for kreport2mpa as input are bracken).
For Humann analyses we can also filter out all non bacteria species (eukaryote, viruses etc…) ?