ShortBRED identify gives no markers

plicht · July 12, 2021, 4:49pm

Hello,
I am trying to screen my metagenomic libraries for antimicrobial peptides present. Therfore I downloaded the database from APD3 (https://academic.oup.com/nar/article/44/D1/D1087/2503090) (https://wangapd3.com/APD_sequence_release_09142020.fasta). After filtering peptides coming from bacteria only I tried to build markers using shortbred_identify.py against UniRef90.

However, I was not able to get any markers. After Grouping my input proteins with CD-HIT no sequences have been sorted. The input proteins are quite small (Min: 2, 25% Quartile 19, Median 30, 75% Quartile 44, Max 100 amino acids.

Do I need to adjust flags like --markerlength or --minAln (the minimum for a short, high-identity region.)? I played around with these flags but got only errors.

Best,
Philipp

franzosa · July 12, 2021, 6:09pm

Hmm, those parameters are all about the sizes of marker regions within full-length proteins, which is what we designed ShortBRED to target (the idea being to screen-out non-unique regions of longer proteins that might induce false positives when profiling with short reads). Your starting proteins/peptides are so small that they might be hitting a size limit early in the pipeline, e.g. clustering inside CD-HIT.

Where your peptides are so small to begin with (on the order of a read-length or smaller), you might try just directly searching your reads against the peptides with an accelerated search (with the peptides indexed as a database). As a ShortBRED-like filter, you could also map the peptides against UniRef90 and note if they occur as subsequences of longer proteins. If some do, I’d be less confident about their abundances from the initial search. (The second step would be analogous to ShortBRED saying that it could not identify a unique marker sequence for a protein.)

plicht · November 30, 2021, 1:49pm

Hello @franzosa

First of all, thanks a lot for the nice explanation.
I want to get back to that problem now and I am thinking about using my proteins of interest as a direct input in Shortbred-Quantify (instead of the marker file generated in Shortbred-Identify).

Would you think this process is reasonable? If yes, should I adjust flags like --ID or --pctlength?

From the Shortbred publication I guess this would neet to be done, because the Quantify tier allows mapping of the previously identified markers upstream and downstream. Because I do not input markers but the whole protein sequence instead.

I just used uniprot blast to check some proteins for hits in UniRef90 as proposed by you. I observed strong identity with the protein of interested and quite a lot spurious hits as subsequences in longer proteins (between 40-60% identity). Because Shortbred only reports hits with a minimum identity, would you think these hits can be neglected?

Best
Philipp

plicht · January 4, 2022, 2:06pm

I wonder if you had the time to go through this question @franzosa

Topic		Replies	Views
About the ShortBRED category ShortBRED	0	640	November 12, 2019
Shortbred identify performance with CARD and Uniref90 ShortBRED	0	754	July 30, 2020
Shortbred returns all zero count ShortBRED	1	475	November 5, 2021
ShortBRED run failed due to <Signals.SIGBUS: 7> ShortBRED	0	164	February 20, 2024
Nitrogen markers with ShortBRED Identify ShortBRED	2	332	May 23, 2023

ShortBRED identify gives no markers

Related topics