ShortBRED identify gives no markers

Hello,
I am trying to screen my metagenomic libraries for antimicrobial peptides present. Therfore I downloaded the database from APD3 (https://academic.oup.com/nar/article/44/D1/D1087/2503090) (https://wangapd3.com/APD_sequence_release_09142020.fasta). After filtering peptides coming from bacteria only I tried to build markers using shortbred_identify.py against UniRef90.

However, I was not able to get any markers. After Grouping my input proteins with CD-HIT no sequences have been sorted. The input proteins are quite small (Min: 2, 25% Quartile 19, Median 30, 75% Quartile 44, Max 100 amino acids.

Do I need to adjust flags like --markerlength or --minAln (the minimum for a short, high-identity region.)? I played around with these flags but got only errors.

Best,
Philipp

Hmm, those parameters are all about the sizes of marker regions within full-length proteins, which is what we designed ShortBRED to target (the idea being to screen-out non-unique regions of longer proteins that might induce false positives when profiling with short reads). Your starting proteins/peptides are so small that they might be hitting a size limit early in the pipeline, e.g. clustering inside CD-HIT.

Where your peptides are so small to begin with (on the order of a read-length or smaller), you might try just directly searching your reads against the peptides with an accelerated search (with the peptides indexed as a database). As a ShortBRED-like filter, you could also map the peptides against UniRef90 and note if they occur as subsequences of longer proteins. If some do, I’d be less confident about their abundances from the initial search. (The second step would be analogous to ShortBRED saying that it could not identify a unique marker sequence for a protein.)

Hello @franzosa

First of all, thanks a lot for the nice explanation.
I want to get back to that problem now and I am thinking about using my proteins of interest as a direct input in Shortbred-Quantify (instead of the marker file generated in Shortbred-Identify).

Would you think this process is reasonable? If yes, should I adjust flags like --ID or --pctlength?

From the Shortbred publication I guess this would neet to be done, because the Quantify tier allows mapping of the previously identified markers upstream and downstream. Because I do not input markers but the whole protein sequence instead.

I just used uniprot blast to check some proteins for hits in UniRef90 as proposed by you. I observed strong identity with the protein of interested and quite a lot spurious hits as subsequences in longer proteins (between 40-60% identity). Because Shortbred only reports hits with a minimum identity, would you think these hits can be neglected?

Best
Philipp

I wonder if you had the time to go through this question @franzosa