I’m asking because I’ve been getting an error with kneaddata v0.12.0 where bowtie2 does not recognize that my data are paired-end and dumps all of the reads into the “_unmatched” output files. I have confirmed that my reads appear correct, and have heard from multiple other users that they are having this same issue (which has also been brought up in other forum posts, including recently).
Thus, I was wondering if this issue has been investigated and potentially addressed in the seemingly new release?
If not, I am not sure how to fix it besides downgrading to kneaddata v0.10.0, which I would prefer to avoid if possible.
Hi @fquerdasi , Yes we have released kneaddata v0.12.0 in PyPI. It should recognize and resolve the issues with pair end reads with spaces or truncation in the read identifiers. Please try it out and let us know how it goes. If you have already tried it out from bioconda (sorry we don’t maintain that channel), feel free to post here if there were any issues and if we so we can follow up with bioconda with an issue.
Thanks for your reply. I did already install it with bioconda (as part of biobakery workflows v3.1) and it is with that version that I am reporting the issue:
When using kneaddata v0.12.0 with the options --input1 and --input2 specified, bowtie2 does not seem to realize that my files are paired end reads rather than unpaired, and saves all the reads in “_unmatched”. In contrast, trimmomatic seems to work as expected. My sequence identifiers are @A00674:488:HGYKGDSX5:1:1101:3965:1000 1:N:0:ACGATCAG+GAGACGAT for R1, and @A00674:488:HGYKGDSX5:1:1101:3965:1000 2:N:0:ACGATCAG+GAGACGAT for R2.
In other words, the read identifiers appear to be the same in both pairs.
I used default parameters for bowtie2.
The exact command I used is:
kneaddata --input1 $1 --input2 $2 --output <directory_name> --threads 16 --processes 16 --remove-intermediate-output --cat-final-output --reference-db /u/local/apps/BIOBAKERY/biobakery_workflows_databases/kneaddata_db_human_genome --serial
Have you found the solution to the issue now? I am encountering the same issue and can’t fix it.
Can I go ahead with HUMAnN microbial profiling with concatenated unmatched reads as input, considering HUMAnN uses a single input and doesn’t count paired reads?
@lauren.j.mciver: I went ahead and downloaded kneaddata with pip (my understanding is that doing so would install the PyPI version), and compared the syntax in the function get_reformatted_identifiers (in the utilities.py script) to the kneaddata version I downloaded with conda as part of biobakery v3.1. They were identical.
I suspect the issue may be with bowtie2. The biobakery_workflows default bowtie2 version installed is 2.2.3. I’ve seen other posts on this forum from people with similar issues who have solved it by changing the bowtie2 version to 2.4.1, but that did not solve the issue for me.
The only fix that worked for me was downgrading kneaddata from version 0.12.0 to v 0.10.0. Other proposed fixes, including changing the read headers and bowtie2 version, did not work for me.
I hope that the developers address this issue in v0.12.0, but until then, downgrading to v0.10.0 may be helpful.
I am also having the same problems. I’ve tried Kneaddata 0.12.0 as well as 0.10.0, both versions from PyPI and it’s unable to match the headers. I edited the headers in one file with sed (as explained in a few posts) and that fixed it. So it seems to not recognize the format " 1:" and " 2:" but it’s not ideal to have to unzip, edit, and zip so many huge files… If anyone knows how to get kneaddata to natively recognize this standard format it would be great!