Non-printing character 0x00 in sequence FASTA file ERROR

ekroeber · April 19, 2021, 8:41am

Hi,
I ran ShortBRED with contig files and the results were not as expected, so I wanted to run it again with the error corrected joined sequence fasta files, however, I always get a
Non-printing character 0x00 in sequence FASTA file
Error, when I check the log file:

Tested usearch. Appears to be working.
    Tested blastp. Appears to be working.
    Tested muscle returned a nonzero exit code (typically indicates failure). Please check to ensure the program is working. Will continue running.
    Path for cdhit appears to be fine. This program returns an error [exit code=1] when tested and working properly, so ShortBRED does not check it.
    Tested makeblastdb. Appears to be working.
    Usearch appears to be working.

    Clustering proteins of interest...
    ================================================================
    Program: CD-HIT, V4.7 (+OpenMP), Feb 01 2021, 15:06:42
    Command: /opt/share/software/packages/cdhit-4.6.8/bin/cd-hit
             -i ./SeqForSpartina.fasta -o
             tmp69981618818831066/clust/clust.faa -d 0 -c 0.85 -b
             10 -g 1

    Started: Mon Apr 19 09:53:51 2021
    ================================================================
                                Output                              
    ----------------------------------------------------------------
    total seq: 25
    longest and shortest : 954 and 109
    Total letters: 10886
    Sequences have been sorted

    Approximated minimal memory consumption:
    Sequence        : 0M
    Buffer          : 1 X 10M = 10M
    Table           : 1 X 65M = 65M
    Miscellaneous   : 0M
    Total           : 75M

    Table limit with the given memory limit:
    Max number of representatives: 1279296
    Max number of word counting entries: 90502859


    comparing sequences from          0  to         25

           25  finished         25  clusters

    Apprixmated maximum memory consumption: 76M
    writing new database
    writing clustering information
    program completed !

    Total CPU time 0.11
    Protein sequences clustered.Creating folders for each protein family...
    Making a fasta file for each protein family...
    Aligning sequences in each family, producing consensus sequences...
    Making BLAST database for the family consensus sequences...
    Making BLAST database for the reference protein sequences...
    BLASTing the consensus family sequences against themselves...
    Warning: [blastp] Number of threads was reduced to 64 to match the number of available CPUs
    BLASTing the consensus family sequences against the reference protein sequences...
    Warning: [blastp] Number of threads was reduced to 64 to match the number of available CPUs
    Finding overlap with reference database...
    Finding overlap with family consensus database...
    Found True Markers...
    No Quasi Markers needed...

    Tmp markers saved to tmp69981618818831066/framecheck/FirstMarkers.faa

    Processing complete! Final markers saved to ./markersforSpartina.fasta
    Checking dependencies...
    Checking to make sure that installed version of usearch can make databases...
    Tested usearch. Appears to be working.
    Treating input as a wgs file...
    usearch v7.0.1090_i86linux32, 4.0Gb RAM (528Gb total), 64 cores
    (C) Copyright 2013 Robert C. Edgar, all rights reserved.
    http://drive5.com/usearch

    Licensed to: hgruber@mpi-bremen.de

    00:00  19Mb Reading input
    00:00  22Mb    1.0% Masking
    00:00  22Mb  100.0% Masking
    00:00  35Mb    1.0% Word stats
    00:00  35Mb  100.0% Word stats
    00:00  73Mb    0.0% Building slots
    00:01  73Mb   24.7% Building slots
    00:02  73Mb   78.1% Building slots
    00:02  73Mb  100.0% Building slots
    00:02  60Mb    1.0% Build index   
    00:02  64Mb  100.0% Build index
    00:02  64Mb    0.0% Rows       
    00:02  64Mb  100.0% Rows
    00:02  64Mb Buffers     
    00:02  80Mb    1.0% Seqs
    00:02  80Mb  100.0% Seqs
    00:02  64Mb 100.0% completed, split 1 (97 seqs)
    00:02  64Mb Total 1 splits, 97 seqs

    List of files in WGS set:./joinedMSAP1.fasta

    List of files in WGS set (after unpacking tarfiles):./joinedMSAP1.fasta 

    Working on file 1 of 1
    usearch v7.0.1090_i86linux32, 4.0Gb RAM (528Gb total), 64 cores
    (C) Copyright 2013 Robert C. Edgar, all rights reserved.
    http://drive5.com/usearch

    Licensed to: hgruber@mpi-bremen.de

    00:00  19Mb Reading /scratch/ekroeber/tmp.942458/joinedMSAP1_shortbred_tmp/markersforSpartina.fasta.udb
    00:00  57Mb Database loaded
    00:00  58Mb    0.1% Searching, 0.0% matched
    00:01  62Mb    0.5% Searching, 0.4% matched
    00:02  62Mb    1.2% Searching, 0.4% matched
    00:03  62Mb    2.0% Searching, 0.4% matched
    00:04  62Mb    2.9% Searching, 0.4% matched
    00:05  62Mb    3.7% Searching, 0.4% matched
    00:06  62Mb    4.5% Searching, 0.4% matched
    00:07  62Mb    5.3% Searching, 0.4% matched
    00:08  62Mb    6.1% Searching, 0.4% matched

    fastaseqsource.cpp(242): 

    /opt/extern/bremen/symbiosis/phyloFlash_old/tools/usearch7 --usearch_local /scratch/ekroeber/tmp.942458/joinedMSAP1_shortbred_tmp/fasta.fna --db /scratch/ekroeber/tmp.942458/joinedMSAP1_shortbred_tmp/markersforSpartina.fasta.udb --id 0.95 --userout /scratch/ekroeber/tmp.942458/joinedMSAP1_shortbred_tmp/wgs_01out_01.out --userfields query+target+id+alnlen+mism+opens+qlo+qhi+tlo+thi+evalue+bits+ql+tl+qs+ts --maxaccepts 1 --maxrejects 32 --threads 1

    ---Fatal error---
    Non-printing character 0x00 in sequence FASTA file '/scratch/ekroeber/tmp.942458/joinedMSAP1_shortbred_tmp/fasta.fna' line 1610696
    Traceback (most recent call last):
      File "/opt/share/software/packages/shortbred-0.9.4/shortbred_quantify.py", line 558, in <module>
        iThreads=args.iThreads,dID=args.dID,iAccepts=args.iMaxHits, iRejects=args.iMaxRejects,strUSEARCH=args.strUSEARCH )
      File "/opt/share/software/packages/shortbred-0.9.4/src/quantify_functions.py", line 232, in RunUSEARCH
        "--maxrejects",str(iRejects),"--threads", str(iThreads)])
      File "/usr/lib/python2.7/subprocess.py", line 186, in check_call
        raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command '['/opt/extern/bremen/symbiosis/phyloFlash_old/tools/usearch7', '--usearch_local', '/scratch/ekroeber/tmp.942458/joinedMSAP1_shortbred_tmp/fasta.fna', '--db', '/scratch/ekroeber/tmp.942458/joinedMSAP1_shortbred_tmp/markersforSpartina.fasta.udb', '--id', '0.95', '--userout', '/scratch/ekroeber/tmp.942458/joinedMSAP1_shortbred_tmp/wgs_01out_01.out', '--userfields', 'query+target+id+alnlen+mism+opens+qlo+qhi+tlo+thi+evalue+bits+ql+tl+qs+ts', '--maxaccepts', '1', '--maxrejects', '32', '--threads', '1']' returned non-zero exit status 1
    DONE!

What am I doing wrong. How can I solve this problem? I need the results asap, since I want to include them into the revisions for a mansucript which is due for re-submission very soon.
Can someone help me, please?

franzosa · April 19, 2021, 3:58pm

It sounds like the FASTA file might have a formatting problem (as can result from e.g. an interrupted download). You might be able to see the error if you scroll through the file. Otherwise you can try one of these methods to remove such characters from the file:

However, if the file is ill-formatted beyond the inclusion of these characters then this won’t be a complete solution.

ekroeber · April 19, 2021, 6:23pm

Thank you for your answer.

That was also what I thought. However, the file is a tmp file generated by ShortBRED itself from my file input. I looked in that file for this ‘filler (0x00)’, however, it actually is not in the file.
If I remove it from the tmp file then this won’t help, when I run the ShortBRED script again, since the same file will be generated. I will try to get the raw reads again and hope then it works. Maybe you are right and they somehow ill formated during download, although they look good when I scroll through.
Thank you

ekroeber · April 20, 2021, 8:10am

Hi there,

so I removed the non-printable characters as suggested, but the error persits and ShortBRED won’t finish Any other suggestions? Really need those results…

franzosa · April 20, 2021, 2:44pm

How big is the file throwing the error?

/scratch/ekroeber/tmp.942458/joinedMSAP1_shortbred_tmp/fasta.fna

Are you able to share it so we can take a look?

ekroeber · April 21, 2021, 6:56am

Hi,

I added the file to the following Dropbox folder:

The file fasta.fna is not one of my input files. It is generated by ShortBRED itself from my input file.
You can also find a log file in the folder (shortbred_joinedreads_new_LOG.sh.o945206), the GOI file (SeqForSpartina_GOI.fasta) and the WGS file (joinedMSAP1_new_WGS.fasta).

I thought it might have to do something with my input wgs file, however I tried different things, e.g. to remove 0x00 etc. via the commands you suggested etc. but nothing seemed to work. But ShortBRED itself works, just not with that file.

Thank you!!!

franzosa · April 22, 2021, 1:19pm

Looking at the output, there is something wrong with the sequence >M02610:50:000000000-ABTB1:1:1101:19255:8029 in the derived fasta.fna file. However, the original sequence (in your WGS.fasta file) looks fine, so I’m not sure what’s happening (I was expecting the original sequence to also have unusual characters in it). You could try removing the sequence from the WGS file and see if that helps - it’s possible though that a similar issue will then arise later in the file and that this was just the first instance.

ekroeber · April 22, 2021, 2:11pm

Hi Eric,

thank you for your answer. Indeed, I removed the sequence but the error persits:

Tested usearch. Appears to be working.
Tested blastp. Appears to be working.
Tested muscle returned a nonzero exit code (typically indicates failure). Please check to ensure the program is working. Will continue running.
Path for cdhit appears to be fine. This program returns an error [exit code=1] when tested and working properly, so ShortBRED does not check it.
Tested makeblastdb. Appears to be working.
Usearch appears to be working.

Clustering proteins of interest...
================================================================
Program: CD-HIT, V4.7 (+OpenMP), Feb 01 2021, 15:06:42
Command: /opt/share/software/packages/cdhit-4.6.8/bin/cd-hit
         -i ./SeqForSpartina.fasta -o
         tmp832611619098681397/clust/clust.faa -d 0 -c 0.85 -b
         10 -g 1

Started: Thu Apr 22 15:38:01 2021
================================================================
                            Output                              
----------------------------------------------------------------
total seq: 25
longest and shortest : 954 and 109
Total letters: 10886
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 0M
Buffer          : 1 X 10M = 10M
Table           : 1 X 65M = 65M
Miscellaneous   : 0M
Total           : 75M

Table limit with the given memory limit:
Max number of representatives: 1279296
Max number of word counting entries: 90502859


comparing sequences from          0  to         25

       25  finished         25  clusters

Apprixmated maximum memory consumption: 76M
writing new database
writing clustering information
program completed !

Total CPU time 0.10
Protein sequences clustered.Creating folders for each protein family...
Making a fasta file for each protein family...
Aligning sequences in each family, producing consensus sequences...
Making BLAST database for the family consensus sequences...
Making BLAST database for the reference protein sequences...
BLASTing the consensus family sequences against themselves...
Warning: [blastp] Number of threads was reduced to 64 to match the number of available CPUs
BLASTing the consensus family sequences against the reference protein sequences...
Warning: [blastp] Number of threads was reduced to 64 to match the number of available CPUs
Finding overlap with reference database...
Finding overlap with family consensus database...
Found True Markers...
No Quasi Markers needed...

Tmp markers saved to tmp832611619098681397/framecheck/FirstMarkers.faa

Processing complete! Final markers saved to ./markersforSpartina.fasta
Checking dependencies...
Checking to make sure that installed version of usearch can make databases...
Tested usearch. Appears to be working.
Treating input as a wgs file...
usearch v7.0.1090_i86linux32, 4.0Gb RAM (528Gb total), 64 cores
(C) Copyright 2013 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

Licensed to: hgruber@mpi-bremen.de

00:00  19Mb Reading input
00:00  22Mb    1.0% Masking
00:00  22Mb  100.0% Masking
00:00  35Mb    1.0% Word stats
00:00  35Mb  100.0% Word stats
00:00  73Mb    0.0% Building slots
00:01  73Mb    0.8% Building slots
00:02  73Mb   56.4% Building slots
00:02  73Mb  100.0% Building slots
00:02  60Mb    1.0% Build index   
00:02  64Mb  100.0% Build index
00:02  64Mb    0.0% Rows       
00:02  64Mb  100.0% Rows
00:02  64Mb Buffers     
00:02  80Mb    1.0% Seqs
00:02  80Mb  100.0% Seqs
00:02  64Mb 100.0% completed, split 1 (97 seqs)
00:02  64Mb Total 1 splits, 97 seqs

List of files in WGS set:./joinedMSAP1_new.fasta

List of files in WGS set (after unpacking tarfiles):./joinedMSAP1_new.fasta 

Working on file 1 of 1
usearch v7.0.1090_i86linux32, 4.0Gb RAM (528Gb total), 64 cores
(C) Copyright 2013 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

Licensed to: hgruber@mpi-bremen.de

00:00  19Mb Reading /scratch/ekroeber/tmp.948167/joinedMSAP1_new_shortbred_tmp/markersforSpartina.fasta.udb
00:00  57Mb Database loaded
00:00  58Mb    0.1% Searching, 0.0% matched
00:01  62Mb    0.4% Searching, 0.4% matched
00:02  62Mb    1.3% Searching, 0.4% matched
00:03  62Mb    2.2% Searching, 0.4% matched
00:04  62Mb    3.0% Searching, 0.4% matched
00:05  62Mb    3.9% Searching, 0.4% matched
00:06  62Mb    4.8% Searching, 0.4% matched
00:07  62Mb    5.6% Searching, 0.4% matched
00:08  62Mb    6.4% Searching, 0.4% matched
00:09  62Mb    7.2% Searching, 0.4% matched
00:10  62Mb    8.1% Searching, 0.4% matched
00:11  62Mb    8.9% Searching, 0.4% matched
00:12  62Mb    9.7% Searching, 0.4% matched

fastaseqsource.cpp(242): 

/opt/extern/bremen/symbiosis/phyloFlash_old/tools/usearch7 --usearch_local /scratch/ekroeber/tmp.948167/joinedMSAP1_new_shortbred_tmp/fasta.fna --db /scratch/ekroeber/tmp.948167/joinedMSAP1_new_shortbred_tmp/markersforSpartina.fasta.udb --id 0.95 --userout /scratch/ekroeber/tmp.948167/joinedMSAP1_new_shortbred_tmp/wgs_01out_01.out --userfields query+target+id+alnlen+mism+opens+qlo+qhi+tlo+thi+evalue+bits+ql+tl+qs+ts --maxaccepts 0.0 --maxrejects 0.0 --threads 1

---Fatal error---
Non-printing character 0x00 in sequence FASTA file '/scratch/ekroeber/tmp.948167/joinedMSAP1_new_shortbred_tmp/fasta.fna' line 2338022
Traceback (most recent call last):
  File "/opt/share/software/packages/shortbred-0.9.4/shortbred_quantify.py", line 558, in <module>
    iThreads=args.iThreads,dID=args.dID,iAccepts=args.iMaxHits, iRejects=args.iMaxRejects,strUSEARCH=args.strUSEARCH )
  File "/opt/share/software/packages/shortbred-0.9.4/src/quantify_functions.py", line 232, in RunUSEARCH
    "--maxrejects",str(iRejects),"--threads", str(iThreads)])
  File "/usr/lib/python2.7/subprocess.py", line 186, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/opt/extern/bremen/symbiosis/phyloFlash_old/tools/usearch7', '--usearch_local', '/scratch/ekroeber/tmp.948167/joinedMSAP1_new_shortbred_tmp/fasta.fna', '--db', '/scratch/ekroeber/tmp.948167/joinedMSAP1_new_shortbred_tmp/markersforSpartina.fasta.udb', '--id', '0.95', '--userout', '/scratch/ekroeber/tmp.948167/joinedMSAP1_new_shortbred_tmp/wgs_01out_01.out', '--userfields', 'query+target+id+alnlen+mism+opens+qlo+qhi+tlo+thi+evalue+bits+ql+tl+qs+ts', '--maxaccepts', '0.0', '--maxrejects', '0.0', '--threads', '1']' returned non-zero exit status 1
DONE!

Any other idea how I could make it work?

Thank you
Eileen

franzosa · April 22, 2021, 7:43pm

Sadly no - it’s having the same error further down in the original file. There is something about those sequences causing them to produce weird output characters, but I’m not sure what it is - I haven’t seen this before.

ekroeber · April 27, 2021, 12:24pm

Hi Eric,

I tried to use different approaches do join my reads and to convert them from fastq to fasta, but so far with no success.

The last error I’ve got even occurs earlier and doesn’t seem to have to do with my input file:

Tested usearch. Appears to be working.
Tested blastp. Appears to be working.
Tested muscle returned a nonzero exit code (typically indicates failure). Please check to ensure the program is working. Will continue running.
Path for cdhit appears to be fine. This program returns an error [exit code=1] when tested and working properly, so ShortBRED does not check it.
Tested makeblastdb. Appears to be working.
Usearch appears to be working.

Clustering proteins of interest...
================================================================
Program: CD-HIT, V4.7 (+OpenMP), Feb 01 2021, 15:06:42
Command: /opt/share/software/packages/cdhit-4.6.8/bin/cd-hit
         -i ./SeqForSpartina.fasta -o
         tmp929361619523380755/clust/clust.faa -d 0 -c 0.85 -b
         10 -g 1

Started: Tue Apr 27 13:36:20 2021
================================================================
                            Output                              
----------------------------------------------------------------
total seq: 25
longest and shortest : 954 and 109
Total letters: 10886
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 0M
Buffer          : 1 X 10M = 10M
Table           : 1 X 65M = 65M
Miscellaneous   : 0M
Total           : 75M

Table limit with the given memory limit:
Max number of representatives: 1279296
Max number of word counting entries: 90502859


comparing sequences from          0  to         25

       25  finished         25  clusters

Apprixmated maximum memory consumption: 76M
writing new database
writing clustering information
program completed !

Total CPU time 0.12
Protein sequences clustered.Creating folders for each protein family...
Making a fasta file for each protein family...
Aligning sequences in each family, producing consensus sequences...
Making BLAST database for the family consensus sequences...
Making BLAST database for the reference protein sequences...
BLASTing the consensus family sequences against themselves...
Warning: [blastp] Number of threads was reduced to 64 to match the number of available CPUs
BLASTing the consensus family sequences against the reference protein sequences...
Warning: [blastp] Number of threads was reduced to 64 to match the number of available CPUs
Finding overlap with reference database...
Finding overlap with family consensus database...
Found True Markers...
No Quasi Markers needed...

Tmp markers saved to tmp929361619523380755/framecheck/FirstMarkers.faa

Processing complete! Final markers saved to ./markersforSpartina.fasta
Checking dependencies...
Checking to make sure that installed version of usearch can make databases...
Tested usearch. Appears to be working.
Treating input as a wgs file...
usearch v7.0.1090_i86linux32, 4.0Gb RAM (528Gb total), 64 cores
(C) Copyright 2013 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

Licensed to: hgruber@mpi-bremen.de

00:00  19Mb Reading input
00:00  22Mb    1.0% Masking
00:00  22Mb  100.0% Masking
00:00  35Mb    1.0% Word stats
00:00  35Mb  100.0% Word stats
00:00  73Mb    0.0% Building slots
00:01  73Mb   28.7% Building slots
00:02  73Mb   81.5% Building slots
00:02  73Mb  100.0% Building slots
00:02  60Mb    1.0% Build index   
00:02  64Mb  100.0% Build index
00:02  64Mb    0.0% Rows       
00:02  64Mb  100.0% Rows
00:02  64Mb Buffers     
00:02  80Mb    1.0% Seqs
00:02  80Mb  100.0% Seqs
00:02  64Mb 100.0% completed, split 1 (97 seqs)
00:02  64Mb Total 1 splits, 97 seqs

List of files in WGS set:./fastqjoinMSAP1.fasta

List of files in WGS set (after unpacking tarfiles):./fastqjoinMSAP1.fasta 

Working on file 1 of 1
Tabulating results for each marker... 
Traceback (most recent call last):
  File "/opt/share/software/packages/shortbred-0.9.4/shortbred_quantify.py", line 584, in <module>
    dReadLength = float(args.iAvgReadBP), iWGSReads = iTotalReadCount, strCentCheck=args.strCentroids,dAlnLength=args.dAlnLength,strFile = strInputFile)
  File "/opt/share/software/packages/shortbred-0.9.4/src/quantify_functions.py", line 544, in CalculateCounts
    sys.stderr.write("WARNING: 0 Reads found in file:" + strFile )
TypeError: cannot concatenate 'str' and 'list' objects
DONE!

Any idea what’s going on here?

Thank you,
Eileen

Topic		Replies	Views
Shortbred error 0 reads in file ShortBRED	2	446	January 8, 2023
Shortbred makeblastdb issue ShortBRED	0	24	January 20, 2025
MUSCLE in ShortBRED ShortBRED	4	432	September 5, 2023
Creating markers after blasting ShortBRED	5	738	April 1, 2020
Site-package __init__.py Error "Unknown format" ShortBRED	2	642	February 28, 2022

Non-printing character 0x00 in sequence FASTA file ERROR

Related topics