Building own reduced database for AMPs, Virulence factors and others

plicht · July 8, 2021, 3:29pm

First of all, thanks for this amazing tool! The species stratifications are great.

What I am wondering about is the possibility to build an own database to screen against like one can do with ShortBRED. Particularly, I am interested in antimicrobial peptides (AMPs), secondary metabolites, virulence factors and antibiotic resistence genes. They should also be included in UniRef90, but when using those comprehensive databases I got the feeling that my results I am interested are noisy because of way more abundand genes like peptodiglycan synthesis.

Best

Philipp

franzosa · July 9, 2021, 7:21pm

The concern is more so that, when you’re doing really broad functional profiling, you run the risk of losing some specificity (i.e. saying proteins of interest are present when they are not). HUMAnN does a lot of internal QC to maximize specificity, but there are some limits to what it can do while also focusing on the entire protein universe. ShortBRED is only focusing on a small number of proteins at a time, so it can do a lot more upstream work to try to identify the best subsequences to look for to identify and quantify those proteins.

Notably, using a more restricted database in HUMAnN would not necessarily solve this problem, as you could still have off-target hits from your sample map to the smaller database and give the appearance that those proteins were present, when in fact they weren’t. For example, imagine you had a protein X with domains A and B and you made a database of just that protein. If your sample contains other proteins with homologous A and B domain sequences, the X protein might recruit their reads and appear to be present even if it wasn’t.

plicht · July 12, 2021, 11:23am

Thanks for your detailed explanation @franzosa

So you would recommend using HUMAnN only when screening against a comprehensive database because then the relation between specificity and power is better compared to a screen against a reduced database?

My next best guess is, to get the advantage of stratifications outputted by HUMAnN, to perform like a prescreen in ShortBRED and then to use the proteins found to be present as a search databse in HUMAnN.

franzosa · July 12, 2021, 5:47pm

The tools (HUMAnN and ShortBRED) are really geared toward separate questions/goals, and their databases and algorithms reflect those differences.

HUMAnN = “I want to assign as many reads in my sample to candidate functions as possible. Coverage is more important to me than being 100% specific.”

ShortBRED = “Do a small number of functions of interest occur in my samples, and if so, in what quantities? I really need to avoid false positives as much as possible.”

If using both tools, I would guess the progression would be more like “Find an interesting pattern for function X in the HUMAnN output” → “Confirm that pattern using ShortBRED.” That assumes you’re starting your analysis without a specific hypothesis about function X; if you are interested in function X going in then I would just start with a ShortBRED approach.

Topic		Replies	Views
Virulence Factor Identification HUMAnN	3	411	August 30, 2021
Quantifying custom genes abundance HUMAnN	0	398	July 26, 2020
Query regarding HUMAnN2 HUMAnN	2	647	April 6, 2020
Protein database choose and low aligned rate in humann HUMAnN	13	543	September 5, 2023
Running HUMAnN: pre-computed protein blastx M8 input HUMAnN	8	571	June 8, 2022

Building own reduced database for AMPs, Virulence factors and others

Related topics