All paired-end read unmatched

If anyone is still facing this issue and doesn’t want to modify the utility.py file, you can use the following shell one-liner as a workaround.

The issue arises because the following code block, which was present in version 0.10.0, is missing from the utility.py file in version 0.12.0:

with open(new_file, "wt") as file_handle:
    for lines in read_file_n_lines(file,4):
        # Reformat the identifier and write to the temp file
        if " 1:" in lines[0]:
            lines[0] = lines[0].replace(" 1", "").rstrip() + "#0/1\n"
        elif " 2:" in lines[0]:
            lines[0] = lines[0].replace(" 2", "").rstrip() + "#0/2\n"
        elif " " in lines[0]:
            lines[0] = lines[0].replace(" ", "")
        file_handle.write("".join(lines))

# Add the new file to the list of temp files
update_temp_output_files(temp_file_list, new_file, all_input_files)

return new_file

I’m not sure why this was removed, but because of this, certain headers cannot be handled, such as:

@VH00481:160:AAFLVL2M5:1:1101:65532:1114 1:N:0:GCATGT
@VH00481:160:AAFLVL2M5:1:1101:65532:1114 2:N:0:GCATGT

It seems that since the paired-end information comes after a space (rather than at the end), some programs might not correctly recognize the data as paired.

Workaround Using awk

You can use awk to modify the header of a gzipped FASTQ file. Here’s a solution that replaces " 1" with "" and appends "#0/1" to the header:

For R1:

zcat your_fastq_file.gz | awk 'NR%4==1 {gsub(/ 1/, ""); print $0"#0/1"; next} {print}' | gzip > modified_fastq_file.gz

For R2:

zcat your_fastq_file.gz | awk 'NR%4==1 {gsub(/ 2/, ""); print $0"#0/2"; next} {print}' | gzip > modified_fastq_file.gz

Explanation:

  • zcat your_fastq_file.gz: Unzips the gzipped FASTQ file.
  • awk 'NR%4==1 { ... }': Modifies only the header lines (every 1st line of a 4-line block).
    • gsub(/ 1/, ""): Removes " 1" from the header.
    • print $0"#0/1": Appends "#0/1" to the end of the header for R1, similarly for R2.
  • next: Skips to the next line after modifying the header.
  • {print}: Prints the non-header lines (sequence, plus sign, and quality).
  • gzip > modified_fastq_file.gz: Compresses the output back into a gzipped file.

Overwriting the Original File:

If you want to overwrite the original file, you can use:

zcat your_fastq_file.gz | awk 'NR%4==1 {gsub(/ 1/, ""); print $0"#0/1"; next} {print}' | gzip > temp_file.gz && mv temp_file.gz your_fastq_file.gz

This ensures that only the headers are modified (with all information still present) while the rest of the FASTQ file remains intact.

Once done, everything should work as expected.

Best regards,
Anupam

1 Like