I have already run blast of my proteins against all uniref90 (blast/2.2.29), it was 48 days. So, if you can help to solve the problem using this output I will be very grateful.
Thank you for replying to me! Sure, I can do that by copy-pasting the error that I got when I run the following command “shortbred_identify.py --cdhit /hpcfs/apps/cd-hit/4.6.1/cd-hit --usearch {main}/usearch11.0.667_i86linux32 --goiclust {main}/attempt2/clust/clust.renamed.faa --goiblast {main}/attempt2/blastresults/selfblast.txt --refblast {main}/attempt2/blastresults/refblast.txt --map_in ${main}/attempt2/clust/clust.map --markers TAII.markers.faa --threads 12”
In my understanding, based on the thread that I linked, the problem is bc the differences between headers of fasta files and blast queries. In fact, I found that clust.faa headers are different to the IDs of the blast queries. The difference was easy to fix since blast queries are getting as protein IDs only the characters before the first space of the clus.faa header, so I changed the header of clust.faa but the problem persists. Another thing that I review was to look for this protein ‘WP_080765506.1’ which is reported in the error, this protein is present in all the files (clus.faa, clus.map, and self.txt), but not in the ref.txt, I don’t know if it could be associated with the problem.
Finding overlap with reference database…
Finding overlap with family consensus database…
Checking dependencies…
Checking to make sure that installed version of usearch can make databases…
Traceback (most recent call last):
File “/hpcfs/home/ciencias/biologia/postgrado/l.avellaneda50/.conda/envs/concoct_env/bin/shortbred_identify.py”, line 364, in
dictGOICounts = pb.MarkX(dictGOIGenes,dictGOICounts)
File “/hpcfs/home/ciencias/biologia/postgrado/l.avellaneda50/.conda/envs/concoct_env/bin/src/process_blast.py”, line 491, in MarkX
dictOverlap[strName][i] = dictOverlap[strName][i] + 9999999
KeyError: ‘WP_080765506.1’
DONE
Sorry for the long delay. My best guess is still that this results from something in the sequence renaming (e.g. the sequence’s name in the FASTA of genes of interest is not the same as the version in the BLAST output, perhaps due to an extra space at the end, removal of the .1, etc.).
If that’s not the case, I found a similar error from a long time ago that resulted from a duplicated sequence ID, though I believe that issue was corrected in the software. Still, worth a double check that the renaming process didn’t induce duplicated sequence IDs in your files.