I am trying to build markers for antibiotic resistant genes using the updated CARD database. I see that there are different UniRef databases (50, 90, or 100). So, I wonder what would be the consequence of using one UniRef database over another. On the ShortBRED official page, you mention UniRef90, is this the background database you suggest for ShortBRED, if so, what are your reasoning for choosing this over UniRef100 or UniRef50? Would UniRef100 be a possible background reference database for ShortBRED-Identify?
We choose UniRef90 as the default database for the ShortBRED because it performs better than the UniRef50 or the UniRef100. UniRef90 is built by clustering UniRef100 sequences such that each cluster is composed of sequences that have at least 90% sequence identity to, and 80% overlap with the longest sequence in the cluster (the seed sequence).
And to build on this answer a bit, part of why UniRef90 is performing better is that you’re mapping against a smaller sequence set (compared to UniRef100) without losing much resolution. Whereas UniRef50 (clustered at 50% AA identify to be even smaller) starts to lose some of the local homology we’re interested in finding with ShortBRED.