Building own reduced database for AMPs, Virulence factors and others

First of all, thanks for this amazing tool! The species stratifications are great.

What I am wondering about is the possibility to build an own database to screen against like one can do with ShortBRED. Particularly, I am interested in antimicrobial peptides (AMPs), secondary metabolites, virulence factors and antibiotic resistence genes. They should also be included in UniRef90, but when using those comprehensive databases I got the feeling that my results I am interested are noisy because of way more abundand genes like peptodiglycan synthesis.

Best

Philipp

The concern is more so that, when you’re doing really broad functional profiling, you run the risk of losing some specificity (i.e. saying proteins of interest are present when they are not). HUMAnN does a lot of internal QC to maximize specificity, but there are some limits to what it can do while also focusing on the entire protein universe. ShortBRED is only focusing on a small number of proteins at a time, so it can do a lot more upstream work to try to identify the best subsequences to look for to identify and quantify those proteins.

Notably, using a more restricted database in HUMAnN would not necessarily solve this problem, as you could still have off-target hits from your sample map to the smaller database and give the appearance that those proteins were present, when in fact they weren’t. For example, imagine you had a protein X with domains A and B and you made a database of just that protein. If your sample contains other proteins with homologous A and B domain sequences, the X protein might recruit their reads and appear to be present even if it wasn’t.

Thanks for your detailed explanation @franzosa

So you would recommend using HUMAnN only when screening against a comprehensive database because then the relation between specificity and power is better compared to a screen against a reduced database?

My next best guess is, to get the advantage of stratifications outputted by HUMAnN, to perform like a prescreen in ShortBRED and then to use the proteins found to be present as a search databse in HUMAnN.

The tools (HUMAnN and ShortBRED) are really geared toward separate questions/goals, and their databases and algorithms reflect those differences.

HUMAnN = “I want to assign as many reads in my sample to candidate functions as possible. Coverage is more important to me than being 100% specific.”

ShortBRED = “Do a small number of functions of interest occur in my samples, and if so, in what quantities? I really need to avoid false positives as much as possible.”

If using both tools, I would guess the progression would be more like “Find an interesting pattern for function X in the HUMAnN output” → “Confirm that pattern using ShortBRED.” That assumes you’re starting your analysis without a specific hypothesis about function X; if you are interested in function X going in then I would just start with a ShortBRED approach.