ShortBRED key error 'unassigned'

I’m trying to run shortbred.identify on a collection of virulence factors. The program runs for 4-5 days before throwing the following error.

Processing Victors|15900819 …
Processing PATRIC_VF|STM14_0369 …
Processing Victors|16765465 …
Processing VFDB|VFG043504 …
Processing Victors|227082474 …
Processing unassigned …
Checking dependencies…
Checking to make sure that installed version of usearch can make databases…
Traceback (most recent call last):
File “/opt/htcf/spack/opt/spack/linux-ubuntu16.04-x86_64/gcc-5.4.0/shortbred-0.9.4-dah7urs5k3m6grmz63izkcikwocddmmu/bin/shortbred_identify.py”, line 417, in
atupQuasiMarkers1 = pb.FindJMMarker(setLeftover, dictGOIGenes, dictGOIHits,dictRefHits,iShortRegion = int(math.floor(args.iQMlength*.40)),iXlimit=int(args.iXlimit),iMarkerLen=args.iQMlength)
File “/opt/htcf/spack/opt/spack/linux-ubuntu16.04-x86_64/gcc-5.4.0/shortbred-0.9.4-dah7urs5k3m6grmz63izkcikwocddmmu/src/process_blast.py”, line 126, in FindJMMarker
atupHitInfo = dictGOIHits[strGene] + dictRefHits.get(strGene,[])
KeyError: ‘unassigned’
srun: error: n198: task 0: Exited with exit code 1

Any advice would be most appreciated. Thank you!

Hey @bccrhp

I got the same error I was able to solve it. Your Fasta headers need to hold loger descripton of the aminoacid sequences. I. e. X1234 is not sufficient. It needs to hold something like X1234_GAPDH.

This seems to be blastp specific error that causes troubles later in the pipeline of shortbred

EDIT: Check the proteins present in the unassigned.faa file. Those are all proteins with Aminoacid length of 10 max. Delete all Proteins with a sequence length of 10 and smaller from your input protein fasta file (–goi)