Kneaddata Reformatting file sequence identifiers ... Type error

Hi there I’m currently trying to run Kneaddata on a high performance computing cluster. I installed Kneadata using the pip install into a virtual environment, downloaded the indexed human genome database.

command :
kneaddata --input STJ-182-d5-709_S137_L001_R1_001.fastq.gz --input STJ-182-d5-709_S137_L001_R2_001.fastq.gz -db ~/human --output kneaddata_STJ-182_d5 --trimmomatic $PATH_to_Trimmomatic

I’m getting the following error:

Any guidance anyone could offer to resolve this would be much appreciated.

Blockquote

Decompressing gzipped file …

Reformatting file sequence identifiers …

Traceback (most recent call last):
File “/home/rach06/kneaddata/bin/kneaddata”, line 8, in
sys.exit(main())
File “/home/rach06/kneaddata/lib/python3.6/site-packages/kneaddata/knead_data.py”, line 427, in main
args.input[index]=utilities.get_reformatted_identifiers(args.input[index],args.output_dir, temp_output_files)
File “/home/rach06/kneaddata/lib/python3.6/site-packages/kneaddata/utilities.py”, line 258, in get_reformatted_identifiers
os.write(file_out, “”.join(lines))
TypeError: a bytes-like object is required, not ‘str’’

Hi @Rachael-16,

Apologies for the late reply. It looks like there is some problem while kneaddata is trying to reformat the sequence identifier of R1 and R2. Would it be possible to provide me the version of the kneaddata and the first 4 lines of --input STJ-182-d5-709_S137_L001_R1_001.fastq.gz --input STJ-182-d5-709_S137_L001_R2_001.fastq.gz please ?

Regards,
Sagun

The reason for this error is that in Python 3, strings are Unicode, but when transmitting on the network, the data needs to be bytes instead. We can convert bytes to string using bytes class decode() instance method, So you need to decode the bytes object to produce a string. In Python 3 , the default encoding is “utf-8” , so you can use directly:

b"python byte to string".decode("utf-8")

Python makes a clear distinction between bytes and strings . Bytes objects contain raw data — a sequence of octets — whereas strings are Unicode sequences . Conversion between these two types is explicit: you encode a string to get bytes, specifying an encoding (which defaults to UTF-8); and you decode bytes to get a string. Clients of these functions should be aware that such conversions may fail, and should consider how failures are handled.