All paired-end read unmatched

Anupam_Gautam · October 23, 2024, 9:19am

If anyone is still facing this issue and doesn’t want to modify the utility.py file, you can use the following shell one-liner as a workaround.

The issue arises because the following code block, which was present in version 0.10.0, is missing from the utility.py file in version 0.12.0:

with open(new_file, "wt") as file_handle:
    for lines in read_file_n_lines(file,4):
        # Reformat the identifier and write to the temp file
        if " 1:" in lines[0]:
            lines[0] = lines[0].replace(" 1", "").rstrip() + "#0/1\n"
        elif " 2:" in lines[0]:
            lines[0] = lines[0].replace(" 2", "").rstrip() + "#0/2\n"
        elif " " in lines[0]:
            lines[0] = lines[0].replace(" ", "")
        file_handle.write("".join(lines))

# Add the new file to the list of temp files
update_temp_output_files(temp_file_list, new_file, all_input_files)

return new_file

I’m not sure why this was removed, but because of this, certain headers cannot be handled, such as:

@VH00481:160:AAFLVL2M5:1:1101:65532:1114 1:N:0:GCATGT
@VH00481:160:AAFLVL2M5:1:1101:65532:1114 2:N:0:GCATGT

It seems that since the paired-end information comes after a space (rather than at the end), some programs might not correctly recognize the data as paired.

Workaround Using `awk`

You can use awk to modify the header of a gzipped FASTQ file. Here’s a solution that replaces " 1" with "" and appends "#0/1" to the header:

For R1:

zcat your_fastq_file.gz | awk 'NR%4==1 {gsub(/ 1/, ""); print $0"#0/1"; next} {print}' | gzip > modified_fastq_file.gz

For R2:

zcat your_fastq_file.gz | awk 'NR%4==1 {gsub(/ 2/, ""); print $0"#0/2"; next} {print}' | gzip > modified_fastq_file.gz

Explanation:

zcat your_fastq_file.gz: Unzips the gzipped FASTQ file.
awk 'NR%4==1 { ... }': Modifies only the header lines (every 1st line of a 4-line block).
- gsub(/ 1/, ""): Removes " 1" from the header.
- print $0"#0/1": Appends "#0/1" to the end of the header for R1, similarly for R2.
next: Skips to the next line after modifying the header.
{print}: Prints the non-header lines (sequence, plus sign, and quality).
gzip > modified_fastq_file.gz: Compresses the output back into a gzipped file.

Overwriting the Original File:

If you want to overwrite the original file, you can use:

zcat your_fastq_file.gz | awk 'NR%4==1 {gsub(/ 1/, ""); print $0"#0/1"; next} {print}' | gzip > temp_file.gz && mv temp_file.gz your_fastq_file.gz

This ensures that only the headers are modified (with all information still present) while the rest of the FASTQ file remains intact.

Once done, everything should work as expected.

Best regards,
Anupam

Topic		Replies	Views
Updated kneaddata to fix issue with paired-end reads? KneadData	9	1505	October 12, 2023
Paired-end data results in unpaired output KneadData	27	5826	June 20, 2024
Problem with paired end demo on new install KneadData	15	2463	October 4, 2024
Strange output from paired end kneaddata input KneadData	2	2178	August 28, 2020
Kneaddata fail to recognize paired end data KneadData	2	381	August 30, 2023

All paired-end read unmatched

Workaround Using awk

Explanation:

Overwriting the Original File:

Related topics

Workaround Using `awk`