The bioBakery help forum

Creating markers after blasting

Hi there!

I am trying to create some markers, and I am having this problem https://groups.google.com/forum/#!topic/shortbred-users/paafRPifkM8

I have already run blast of my proteins against all uniref90 (blast/2.2.29), it was 48 days. So, if you can help to solve the problem using this output I will be very grateful.

Thank you,
Laura.

Hi Laura - Looks like there’s a bunch of things going on in the thread you linked. Could you clarify what specific problem you’re having? Thanks.

Hi Franzosa!

Thank you for replying to me! Sure, I can do that by copy-pasting the error that I got when I run the following command “shortbred_identify.py --cdhit /hpcfs/apps/cd-hit/4.6.1/cd-hit --usearch {main}/usearch11.0.667_i86linux32 --goiclust {main}/attempt2/clust/clust.renamed.faa --goiblast {main}/attempt2/blastresults/selfblast.txt --refblast {main}/attempt2/blastresults/refblast.txt --map_in ${main}/attempt2/clust/clust.map --markers TAII.markers.faa --threads 12”

In my understanding, based on the thread that I linked, the problem is bc the differences between headers of fasta files and blast queries. In fact, I found that clust.faa headers are different to the IDs of the blast queries. The difference was easy to fix since blast queries are getting as protein IDs only the characters before the first space of the clus.faa header, so I changed the header of clust.faa but the problem persists. Another thing that I review was to look for this protein ‘WP_080765506.1’ which is reported in the error, this protein is present in all the files (clus.faa, clus.map, and self.txt), but not in the ref.txt, I don’t know if it could be associated with the problem.

Finding overlap with reference database…
Finding overlap with family consensus database…
Checking dependencies…
Checking to make sure that installed version of usearch can make databases…
Traceback (most recent call last):
File “/hpcfs/home/ciencias/biologia/postgrado/l.avellaneda50/.conda/envs/concoct_env/bin/shortbred_identify.py”, line 364, in
dictGOICounts = pb.MarkX(dictGOIGenes,dictGOICounts)
File “/hpcfs/home/ciencias/biologia/postgrado/l.avellaneda50/.conda/envs/concoct_env/bin/src/process_blast.py”, line 491, in MarkX
dictOverlap[strName][i] = dictOverlap[strName][i] + 9999999
KeyError: ‘WP_080765506.1’
DONE

Thank you,
Laura.

Hi there!

I couldn’t fix this problem yet, and I really want to use shortbred with this set of proteins. So, if someone can help me I will be grateful.

Thank you,
Laura

Hi! I wonder if you checked my new post. Thank you.

Sorry for the long delay. My best guess is still that this results from something in the sequence renaming (e.g. the sequence’s name in the FASTA of genes of interest is not the same as the version in the BLAST output, perhaps due to an extra space at the end, removal of the .1, etc.).

If that’s not the case, I found a similar error from a long time ago that resulted from a duplicated sequence ID, though I believe that issue was corrected in the software. Still, worth a double check that the renaming process didn’t induce duplicated sequence IDs in your files.