The bioBakery help forum

Removing human transcripts with polyA from RNA data


I have some metatranscriptome samples that have human contamination. I used kneaddata with the human transcriptome as the decontaminant database (–reference-db human_hg38_refMrna). After downstream processing with humann I found that a large % of the reads were unaligned so I looked at the first 30 reads or so and found the majority have a large stretch of polyA at the end of the sequence. The front half of these sequences blasts to human. Is there a good way to filter out these sequences with kneaddata?

1 Like

Hi @jbarlow ,

Thank you for reaching out to bioBakery Lab. Can you confirm that you are using Kneaddata’s human_transcriptome reference database kneaddata_database --download human_transcriptome bowtie2 $DIR please?

You could also try adding the blast results to the database (in the .faa file then build the index) as contaminants and decoys to see if it improves the performance?


Hi @sagunmaharjann ,

Thanks for following up on this. I can definitely confirm I was using the human_transcriptome reference database from kneaddata. I ended up realizing I wasn’t doing adapter trimming correctly (needed to change the default to Truseq) and updated to the kneaddata 0.10 from pip instead of 0.7.4 from conda and then no longer had the issue. Not sure exactly what fixed the problem but all is good now!