All paired-end read unmatched

cfeb · December 21, 2022, 9:28am

Hi, this solution works for me while changing the sequence labels does not. Can anyone direct me to the documentation that shows the difference between applying this “–decontaminate-pairs=lenient” option compared to the standard output?

sagunmaharjann · June 23, 2023, 3:03pm

Hi all,

Thank you for reaching out to the bioBakery Lab forum.

Form Kneaddata>=v0.11.0, there is an option for --input1 and --input2 so if the read identifiers are the same in both pairs, no modification to the input files are necessary as Kneaddata will automatically take care of it.

Regards,
Sagun

fquerdasi · July 18, 2023, 9:47pm

Hello,

I am using kneaddata v0.12.0 with the options --input1 and --input2 specified, and have encountered the same issue others have described. My sequence identifiers are @A00674:488:HGYKGDSX5:1:1101:3965:1000 1:N:0:ACGATCAG+GAGACGAT for R1, and @A00674:488:HGYKGDSX5:1:1101:3965:1000 2:N:0:ACGATCAG+GAGACGAT for R2.

In other words, the read identifiers appear to be the same in both pairs.
I used default parameters for bowtie2.

Yc_Lu · October 17, 2023, 7:47am

i am encounting the same issue that you descripted in your comment. could you tell me the method that solving this issue?please

Yunliang81 · May 1, 2024, 7:43pm

Please refer to the following two links, they will guide you on how to edit kneaddata/utilities.py to fix the label issue

github.com/biobakery/kneaddata

illumina paired end files may have 1:N:0 and 2:N:0

biobakery:master ← EacoChen:master

opened 03:02PM - 03 Sep 22 UTC

EacoChen

+27266 -27263

We welcome feedback and issue reporting for all bioBakery tools through our [Dis…course site] https://forum.biobakery.org/c/pull-request/featurepull-request/). For users that would like to directly contribute to the [tools](https://github.com/biobakery/) we are happy to field PRs to address **bug fixes**. Please note the turn around time on our end might be a bit long to field these but that does not mean we don't value the contribution! We currently **don't** accept PRs to add **new functionality** to tools but we would be happy to receive your feedback on [Discourse] https://forum.biobakery.org/c/pull-request/featurepull-request/). Also, we will make sure to attribute your contribution in our User’s manual(README.md) and in any associated paper Acknowledgements. ## Description Add a new "if" in utilities.get_reformatted_identifiers function Using the regular expression r'\d+:N:\d+:(?=[A|T|C|G])' to get the issue part of sequence label. And then just delete it. ## Related Issue I found some users alway get zero file of two paired files [issue1] (https://forum.biobakery.org/t/paired-end-data-results-in-unpaired-output/928/2) [issue2] (https://forum.biobakery.org/t/no-reads-in-the-final-output-files-created/3419). I also have that issue, my sequence label was like R1: @A00456:506:H5GC2DSXY:1:1101:6027:1000 1:N:0:CAACACAG+CAAGGTAC R2: @A00456:506:H5GC2DSXY:1:1101:6027:1000 2:N:0:CAACACAG+CAAGGTAC So. in old "get_reformatted_identifiers adding" "#0/1" and "#0/2" not solving this issue. and I just delete the part. it worked. ## Screenshots (if appropriate):

Ashma45 · May 3, 2024, 5:09pm

Hi, The kneaddata v0.12.0 with options for paired end reads still doesnt give correct output. All my host removed contam.fastq files are empty with unmatched files. I tried changing the sequence idnetifier, using older version of kneaddata and also decontaminate pairs lenient. none of them works. Are you still working on fixing the issue?

laure_bindels · June 27, 2024, 9:31am

Hi @sagunmaharjann, @Ashma45 and @fquerdasi,

same issue here, after running kneaddata, all reads are unmatched, paired aggregate files are empty.

illumina sequence identifiers
@VL00128:95:AAFFWM5M5:1:1101:18534:1000 1:N:0:CCGGTTCCTA+CTCGAATATA
@VL00128:95:AAFFWM5M5:1:1101:18534:1000 2:N:0:CCGGTTCCTA+CTCGAATATA
Kneaddata 0.12.0, used in paired mode (–input1 --input2).
I met the issue with two different datasets, no issue with the demo files (that have a different type of headers).

I checked this proposition to update the utilities.py script.

it didn’t help unfortunately as I didn’t find these lines in the utilities.py script, kneaddata v0.12.0

Were you able to solve the issue ?
Any tips would be very much appreciated.

Many thanks,
Best wishes,

Laure

Ashma45 · June 27, 2024, 8:51pm

Hi Laure,

I dropped the idea of using kneaddata for paired end reads and chose the longer route of using trimmomatic, bowtie2 to map and remove host reads and samtools to filter out unmapped reads. I think this what kneaddata does basically in one go.

laure_bindels · July 4, 2024, 8:07am

I solved my issue by downgrading kneaddata to v0.10.0. Hope this helps !

EacoChen · July 14, 2024, 1:02pm

The developer have solved this at Nov. 22th 2022

https://github.com/biobakery/kneaddata/commit/ebc077999b7d020f666a11aec782119ab764e00f

But it seems like the didn’t make it a new release in github.

You can set the envrionment by conda parameter with --only-deps, and git clone the kneaddata

Zoexfq · July 15, 2024, 5:29am

Hi, I just came across the same issue.

When using KneadData, it appears that the software first outputs a SAM file and then processes this file to determine the mapping results. My understanding is that during this post-processing step, the software identifies paired reads by examining the suffix of each read’s name, looking for either ‘/1’ or ‘/2’ to differentiate between the two ends of a pair.

While this approach works seamlessly with raw sequencing data, I have found that when working with data obtained from public databases, the read names are often sanitized, and the distinguishing ‘/1’ or ‘/2’ suffixes are removed. This could potentially lead to misidentification of paired reads during the post-processing phase.

Bowtie2 actually offers built-in options to handle such cases elegantly. The --un-conc and --un parameters in bowtie2 are specifically designed to output unmapped reads in a way that retains the paired-end information, even when the read names have been altered or are absent of these suffixes.

Could you maybe include an option for users to enable bowtie2’s --un-conc and --un parameters during the mapping process? This would allow for better handling of paired-end reads with modified names.

marvin · August 9, 2024, 2:29pm

Hey bro,

Could you please specify the details of your solution?

marvin · August 10, 2024, 3:28am

Thanks for your instruction, bro.

Here I post my solution.

Step One

Create a conda environment for kneaddata.

conda create -n kneaddata

conda activate kneaddata

Step Two

Install python for kneaddata environment.

conda install -n kneaddata python=3.10

Step Three

Clone the latest kneaddata program.

git clone https://github.com/biobakery/kneaddata.git

Step Four

Modify some code lines in the script.

cd kneaddata

Open the setup.py in your editor like vim , nano or visual studio code. and find the function install_trf(final_install_folder, mac_os) (around line 240).

The two URLs are no longer available so we need to replace them manually.

def install_trf(final_install_folder, mac_os):
    """ Download and install trf """

    trf_exe = "trf"

    if mac_os:
        url = "https://github.com/Benson-Genomics-Lab/TRF/releases/download/v4.09.1/trf409.macosx"
    else:
        url = "https://github.com/Benson-Genomics-Lab/TRF/releases/download/v4.09.1/trf409.linux64"

Then, save the changes.

Step Five

Please ensure that you’re in the kneaddata conda environment before running the setup.py script.

python setup.py install

It would take a long time for some users to download the dependencies. You can replace the URL (github, sourceforge) with mirror sites url.

After the installation of dependencies, you can perform the kneaddata pipeline for your metagenomic data successfully now!

Anupam_Gautam · October 23, 2024, 9:19am

If anyone is still facing this issue and doesn’t want to modify the utility.py file, you can use the following shell one-liner as a workaround.

The issue arises because the following code block, which was present in version 0.10.0, is missing from the utility.py file in version 0.12.0:

with open(new_file, "wt") as file_handle:
    for lines in read_file_n_lines(file,4):
        # Reformat the identifier and write to the temp file
        if " 1:" in lines[0]:
            lines[0] = lines[0].replace(" 1", "").rstrip() + "#0/1\n"
        elif " 2:" in lines[0]:
            lines[0] = lines[0].replace(" 2", "").rstrip() + "#0/2\n"
        elif " " in lines[0]:
            lines[0] = lines[0].replace(" ", "")
        file_handle.write("".join(lines))

# Add the new file to the list of temp files
update_temp_output_files(temp_file_list, new_file, all_input_files)

return new_file

I’m not sure why this was removed, but because of this, certain headers cannot be handled, such as:

@VH00481:160:AAFLVL2M5:1:1101:65532:1114 1:N:0:GCATGT
@VH00481:160:AAFLVL2M5:1:1101:65532:1114 2:N:0:GCATGT

It seems that since the paired-end information comes after a space (rather than at the end), some programs might not correctly recognize the data as paired.

Workaround Using `awk`

You can use awk to modify the header of a gzipped FASTQ file. Here’s a solution that replaces " 1" with "" and appends "#0/1" to the header:

For R1:

zcat your_fastq_file.gz | awk 'NR%4==1 {gsub(/ 1/, ""); print $0"#0/1"; next} {print}' | gzip > modified_fastq_file.gz

For R2:

zcat your_fastq_file.gz | awk 'NR%4==1 {gsub(/ 2/, ""); print $0"#0/2"; next} {print}' | gzip > modified_fastq_file.gz

Explanation:

zcat your_fastq_file.gz: Unzips the gzipped FASTQ file.
awk 'NR%4==1 { ... }': Modifies only the header lines (every 1st line of a 4-line block).
- gsub(/ 1/, ""): Removes " 1" from the header.
- print $0"#0/1": Appends "#0/1" to the end of the header for R1, similarly for R2.
next: Skips to the next line after modifying the header.
{print}: Prints the non-header lines (sequence, plus sign, and quality).
gzip > modified_fastq_file.gz: Compresses the output back into a gzipped file.

Overwriting the Original File:

If you want to overwrite the original file, you can use:

zcat your_fastq_file.gz | awk 'NR%4==1 {gsub(/ 1/, ""); print $0"#0/1"; next} {print}' | gzip > temp_file.gz && mv temp_file.gz your_fastq_file.gz

This ensures that only the headers are modified (with all information still present) while the rest of the FASTQ file remains intact.

Once done, everything should work as expected.

Best regards,
Anupam

weichi · December 25, 2024, 3:32am

Here is the installation command based on @EacoChen’s comment, which resolved this issue in my case:
Note. I installed Trimmomatic, TRF, Bowtie2, and FastQC by myself, and the commit of KneadData I am using is 90c05f3cd25a8c74a4d804f27dd0ce52718007d8.

git clone https://github.com/biobakery/kneaddata \
    && cd kneaddata \
    && git reset --hard ebc077999b7d020f666a11aec782119ab764e00f \
    && python3 setup.py install --bypass-dependencies-install

Topic		Replies	Views
Paired-end data results in unpaired output KneadData	27	5890	June 20, 2024
Problem with paired end demo on new install KneadData	15	2472	October 4, 2024
Updated kneaddata to fix issue with paired-end reads? KneadData	9	1525	October 12, 2023
No reads in the "Final output files created" KneadData	2	975	April 14, 2022
There are less reads survived after kneaddata KneadData	5	1117	January 23, 2023