All paired-end read unmatched

Hi, this solution works for me while changing the sequence labels does not. Can anyone direct me to the documentation that shows the difference between applying this “–decontaminate-pairs=lenient” option compared to the standard output?

Hi all,

Thank you for reaching out to the bioBakery Lab forum.

Form Kneaddata>=v0.11.0, there is an option for --input1 and --input2 so if the read identifiers are the same in both pairs, no modification to the input files are necessary as Kneaddata will automatically take care of it.

Regards,
Sagun

Hello,

I am using kneaddata v0.12.0 with the options --input1 and --input2 specified, and have encountered the same issue others have described. My sequence identifiers are @A00674:488:HGYKGDSX5:1:1101:3965:1000 1:N:0:ACGATCAG+GAGACGAT for R1, and @A00674:488:HGYKGDSX5:1:1101:3965:1000 2:N:0:ACGATCAG+GAGACGAT for R2.

In other words, the read identifiers appear to be the same in both pairs.
I used default parameters for bowtie2.

2 Likes

i am encounting the same issue that you descripted in your comment. could you tell me the method that solving this issue?please

Please refer to the following two links, they will guide you on how to edit kneaddata/utilities.py to fix the label issue

1 Like

Hi, The kneaddata v0.12.0 with options for paired end reads still doesnt give correct output. All my host removed contam.fastq files are empty with unmatched files. I tried changing the sequence idnetifier, using older version of kneaddata and also decontaminate pairs lenient. none of them works. Are you still working on fixing the issue?

Hi @sagunmaharjann, @Ashma45 and @fquerdasi,

same issue here, after running kneaddata, all reads are unmatched, paired aggregate files are empty.

illumina sequence identifiers
@VL00128:95:AAFFWM5M5:1:1101:18534:1000 1:N:0:CCGGTTCCTA+CTCGAATATA
@VL00128:95:AAFFWM5M5:1:1101:18534:1000 2:N:0:CCGGTTCCTA+CTCGAATATA
Kneaddata 0.12.0, used in paired mode (–input1 --input2).
I met the issue with two different datasets, no issue with the demo files (that have a different type of headers).

I checked this proposition to update the utilities.py script.

it didn’t help unfortunately as I didn’t find these lines in the utilities.py script, kneaddata v0.12.0

Were you able to solve the issue ?
Any tips would be very much appreciated.

Many thanks,
Best wishes,

Laure

Hi Laure,

I dropped the idea of using kneaddata for paired end reads and chose the longer route of using trimmomatic, bowtie2 to map and remove host reads and samtools to filter out unmapped reads. I think this what kneaddata does basically in one go.

2 Likes

I solved my issue by downgrading kneaddata to v0.10.0. Hope this helps !

The developer have solved this at Nov. 22th 2022

https://github.com/biobakery/kneaddata/commit/ebc077999b7d020f666a11aec782119ab764e00f

But it seems like the didn’t make it a new release in github.

You can set the envrionment by conda parameter with --only-deps, and git clone the kneaddata

2 Likes

Hi, I just came across the same issue.

When using KneadData, it appears that the software first outputs a SAM file and then processes this file to determine the mapping results. My understanding is that during this post-processing step, the software identifies paired reads by examining the suffix of each read’s name, looking for either ‘/1’ or ‘/2’ to differentiate between the two ends of a pair.

While this approach works seamlessly with raw sequencing data, I have found that when working with data obtained from public databases, the read names are often sanitized, and the distinguishing ‘/1’ or ‘/2’ suffixes are removed. This could potentially lead to misidentification of paired reads during the post-processing phase.

Bowtie2 actually offers built-in options to handle such cases elegantly. The --un-conc and --un parameters in bowtie2 are specifically designed to output unmapped reads in a way that retains the paired-end information, even when the read names have been altered or are absent of these suffixes.

Could you maybe include an option for users to enable bowtie2’s --un-conc and --un parameters during the mapping process? This would allow for better handling of paired-end reads with modified names.

Hey bro,

Could you please specify the details of your solution?

Thanks for your instruction, bro.

Here I post my solution.

Step One

Create a conda environment for kneaddata.

conda create -n kneaddata

conda activate kneaddata

Step Two

Install python for kneaddata environment.

conda install -n kneaddata python=3.10

Step Three

Clone the latest kneaddata program.

git clone https://github.com/biobakery/kneaddata.git

Step Four

Modify some code lines in the script.

cd kneaddata

Open the setup.py in your editor like vim , nano or visual studio code. and find the function install_trf(final_install_folder, mac_os) (around line 240).

The two URLs are no longer available so we need to replace them manually.

def install_trf(final_install_folder, mac_os):
    """ Download and install trf """

    trf_exe = "trf"

    if mac_os:
        url = "https://github.com/Benson-Genomics-Lab/TRF/releases/download/v4.09.1/trf409.macosx"
    else:
        url = "https://github.com/Benson-Genomics-Lab/TRF/releases/download/v4.09.1/trf409.linux64"

Then, save the changes.

Step Five

Please ensure that you’re in the kneaddata conda environment before running the setup.py script.

python setup.py install

It would take a long time for some users to download the dependencies. You can replace the URL (github, sourceforge) with mirror sites url.

After the installation of dependencies, you can perform the kneaddata pipeline for your metagenomic data successfully now!

2 Likes

If anyone is still facing this issue and doesn’t want to modify the utility.py file, you can use the following shell one-liner as a workaround.

The issue arises because the following code block, which was present in version 0.10.0, is missing from the utility.py file in version 0.12.0:

with open(new_file, "wt") as file_handle:
    for lines in read_file_n_lines(file,4):
        # Reformat the identifier and write to the temp file
        if " 1:" in lines[0]:
            lines[0] = lines[0].replace(" 1", "").rstrip() + "#0/1\n"
        elif " 2:" in lines[0]:
            lines[0] = lines[0].replace(" 2", "").rstrip() + "#0/2\n"
        elif " " in lines[0]:
            lines[0] = lines[0].replace(" ", "")
        file_handle.write("".join(lines))

# Add the new file to the list of temp files
update_temp_output_files(temp_file_list, new_file, all_input_files)

return new_file

I’m not sure why this was removed, but because of this, certain headers cannot be handled, such as:

@VH00481:160:AAFLVL2M5:1:1101:65532:1114 1:N:0:GCATGT
@VH00481:160:AAFLVL2M5:1:1101:65532:1114 2:N:0:GCATGT

It seems that since the paired-end information comes after a space (rather than at the end), some programs might not correctly recognize the data as paired.

Workaround Using awk

You can use awk to modify the header of a gzipped FASTQ file. Here’s a solution that replaces " 1" with "" and appends "#0/1" to the header:

For R1:

zcat your_fastq_file.gz | awk 'NR%4==1 {gsub(/ 1/, ""); print $0"#0/1"; next} {print}' | gzip > modified_fastq_file.gz

For R2:

zcat your_fastq_file.gz | awk 'NR%4==1 {gsub(/ 2/, ""); print $0"#0/2"; next} {print}' | gzip > modified_fastq_file.gz

Explanation:

  • zcat your_fastq_file.gz: Unzips the gzipped FASTQ file.
  • awk 'NR%4==1 { ... }': Modifies only the header lines (every 1st line of a 4-line block).
    • gsub(/ 1/, ""): Removes " 1" from the header.
    • print $0"#0/1": Appends "#0/1" to the end of the header for R1, similarly for R2.
  • next: Skips to the next line after modifying the header.
  • {print}: Prints the non-header lines (sequence, plus sign, and quality).
  • gzip > modified_fastq_file.gz: Compresses the output back into a gzipped file.

Overwriting the Original File:

If you want to overwrite the original file, you can use:

zcat your_fastq_file.gz | awk 'NR%4==1 {gsub(/ 1/, ""); print $0"#0/1"; next} {print}' | gzip > temp_file.gz && mv temp_file.gz your_fastq_file.gz

This ensures that only the headers are modified (with all information still present) while the rest of the FASTQ file remains intact.

Once done, everything should work as expected.

Best regards,
Anupam

2 Likes

Here is the installation command based on @EacoChen’s comment, which resolved this issue in my case:
Note. I installed Trimmomatic, TRF, Bowtie2, and FastQC by myself, and the commit of KneadData I am using is 90c05f3cd25a8c74a4d804f27dd0ce52718007d8.

git clone https://github.com/biobakery/kneaddata \
    && cd kneaddata \
    && git reset --hard ebc077999b7d020f666a11aec782119ab764e00f \
    && python3 setup.py install --bypass-dependencies-install