Kneaddata_SILVA rRNA database error_link to database & direct downloads

Hello bioBakery forum,
I tried to use Kneaddata to remove rRNA in my metatranscriptome data and realized rRNA was not matched to SILVA database I downloaded from the kneaddata manual.

After discussion in the bioBakery course with Curtis and did some digging in the manual and tutorial, I found out two potential following issues.

  1. I used the kneaddata_database --download ribosomal_RNA bowtie2 $DIR to download the SILVA database, it was 11G and downloaded as already bowtie2 built indexes (ends with .bt21). However, the Us were not converted to Ts from the direct download. I go further wanting to convert U and realized we have to convert it in the first step on the SILVA fasta files.

  2. I then found there is a link to SILVA fasta file in the tutorial page http://huttenhower.sph.harvard.edu/kneadData_databases/SILVA_128_LSUParc_SSUParc_ribosomal_RNA_v0.1.tar.gz I wget the link, but the page was not found.

  3. Then I went to SILVA official website trying to find a database… there were many explanations but no direct download of the database. (I may missed it). I used this one wget -O "silva-138-99-515-806-nb-classifier.qza" https://data.qiime2.org/2020.6/common/silva-138-99-515-806-nb-classifier.qza before for amplicon data, but it is not a fasta file.

I am just wondering could biobakery share an updated link to download SILVA rRNA reference database or direct me to a place to download? Then I can convert Us in the downloaded fasta, then bowtie2 build the SILVA rRNA database, then remove rRNA using KneadData in my metatranscriptome data. Alternatively, a direct download of the already bowtie2 built U converted SILVA rRNA would be great too. :slight_smile:

I am also wondering the results only showed adapter removed and human transcriptome removed. The human genome removed did now show. Is that some problem with the bowtie2 build file or metatranscriptome will not match human genome(hg37dec_v0.1.1.bt2 etc)?

Curtis recommended George and Sagun on solving this question in the course :blush: @george-weingart and @sagunmaharjann
Thank you so much!
Yike

Hi Yike,
I will look into it
Best regards
George

1 Like

Hi @YikeShen ,
Thank you for reaching out to bioBakery Lab and catching the inconsistency the our tutorial. The link to the v0.2 Silva Database has been updated in the tutorial (kneaddata · biobakery/biobakery Wiki · GitHub).

Thanks to @george-weingart , the version 0.2-Silva database already has all the Us converted to Ts so it is good to directly use it.

Please see the documentation here (kneaddata · biobakery/biobakery Wiki · GitHub) to see the details on the Kneaddata_count_table script output. Both of the database indexes(hg37 + hg38) are merged in the workflow and the result should be for the merged DB but let me confirm that.

Regards,
Sagun

1 Like

Hi Yike,
I built the new database: I took the 2020 SILVA_128_LSUParc_SSUParc_merged fasta file and used the Bio.Seq module back_transcribe() from Bio.Seq module — Biopython 1.75 documentation to convert the U’s to T’s. That is what it is supposed to be. I will follow up with Lauren to see that is the file that was posted.
Best regards,
George

1 Like

Thank you both so much for quick reply! I downloaded the SILVA from the manual page around 10 days ago. I skimmed the file and did found Us. Please see a screenshot below. I will download again later today to see if all Us are converted.

Hi @YikeShen,
The screenshot would of the bowtie2 indexes(*.bt) this file can be read by the bowtie2 mapper only. However, the fasta reference file which was used to generate the above database had the “U(s)” in it which had to be converted to “T(s)”.

The screenshot you are seeing is perfectly normal and should be good to use if you are using version0.2 Silva.

Regards,
Sagun

1 Like

Thank you! @sagunmaharjann
For the question3, I think I am confused because it’s named differently in my output file (not seq1_kneaddata_demo_db_bowtie2_paired_contam_1.fastq). It’s specifically labelled decontaminated hg_refMrRNA. Also it looks like only the humantranscriptome is executable from download. I chmod u+x and made all executable. But still only decontaminated hg_refMrRNA has output. I did put everything in one database folder.

. If the hg38 and hg37 is merged, how do I see and confirm it’s merged and removed? I ran Kneaddata_count_table and here is the output data, I don’t know how to transpose and print 2 transposed lines in columns side by side in unix, but numbers are corresponding data.

I am also curious why orphan reads 1&2 are different, do they supposed to be the same?

Thank you so much!!
Best,
Yike

Thank you, Sangun! Do you know how do I download and use bowtie2 mapper to see the bowtie2 indexes(*.bt) files? I read the bowtie2 manual and there was no command on that…

Hello George,
Thank you! I just downloaded the v0.2 SILVA in the updated link and noticed the files are already bowtie2 built and it’s exactly the same as the one I download before in the manual page, which kneaddata can’t align to remove rRNA…

Best,
Yike

Hi Yike,
I talked with Sagun about the issue you raised.
Where exactly do you see that the U’s have not been removed ?
Because, you should look only at the fasta files that originated the index, and from the screen prints I see, you see only the Bowtie2 indixes and the the originating fasta file. If you want to verify, you can use the Bowtie2 utility to generate the fasta file from the indexes and then see if there are U’s in the sequences.
Let me know?
Best regards !
George Weingart PhD
Huttenhower Lab

1 Like

Hello George,
I am having hard time finding bowtie2 utility to reconstruct fasta files from bowtie2 indexes. May you direct me to the command?
I vi the bowtie2 built files (e.g. SILVA_128_LSUParc_SSUParc_ribosomal_RNA.1.bt21 and saw Us. But as you and Sangu pointed out, I shouldn’t use vi to see the bowtie2 files. I searched bowtie2 mapper you two mentioned and couldn’t find one. So maybe the U I saw is an artifact.

But in anyway, I put the SILVA, transcriptome, human genome reads into one kneaddata_db folder and the rRNA was not removed from my results :(. Because I QC the after file, it still have >90% of duplicate reads).
Attached is a sequencing count after kneaddata. showed lots of duplications

Thank you!
Best,
Yike

Hi Yike,

The bowtie2 command to rebuild the fasta sequences from the index is bowtie2-inspect.

Let me know if after rebuilding the fasta sequence out of the index you see U’s

Best regards,
George Weingart PhD
Huttenhower Lab

Hello George,
Thank you! I rebuilt and inspected the bowtie2 indexes. There was no Us. I am sorry I thought it was Us vi to .bt21! But if that’s not the issue, why my rRNA was not removed and still show very high sequence duplication rates? Does anybody have ideas? @george-weingart @sagunmaharjann
Thank you!
Best,
Yike

Here is the block code I used. It didn’t return any error message, but it is nor removing rRNA and human reads I think. only human transcriptome bt database are matched and removed.
I am happy to have a quick call if anybody think it’s easier and quicker to solve the question… Thank you!!

kneaddata --input /mnt/ls15/scratch/users/shenyike/metatranscriptome/fastqmerge/1091RNA_R1.fastq.gz
–input /mnt/ls15/scratch/users/shenyike/metatranscriptome/fastqmerge/1091RNA_R2.fastq.gz
–reference-db /mnt/ls15/scratch/users/shenyike/metatranscriptome/kneaddata_db
–output /mnt/ls15/scratch/users/shenyike/metatranscriptome/kneaddataOutputPairedEnd_1091
–trimmomatic /opt/software/Trimmomatic/0.39-Java-1.8/
–trimmomatic-options=“ILLUMINACLIP:/opt/software/Trimmomatic/0.39-Java-1.8/adapters/NexteraPE-PE.fa:2:30:10”

Hi Yike,
I talked to Eric and we think that hat you need to do is the following:
Download the dbs to different locations ( human genome, human transcriptome, and rRNA (SILVA):slight_smile:
and then run a single command that references all the databases.
There is an example of doing a call with multiple databases here:

If you’ll put the SILVA files in one folder, human genome in another, etc. then point to each of them with the --reference-db flag (as in the manual) everything should work nicely.

1 Like

Hello George, Eric, and Sagun,
Thank you so much for your help. The QC was successful! A little suggestion for others who are running kneaddata for metatranscriptome. Request a long computation time. One of my samples took 27 hours to run QC…
I am sorry I do have some more questions looking at the QC results and explanations form the manual…

  1. I think kneaddata does the adapter trimming first, right? Trimmed files are the first came from log file. From the results, adapter trimmed reads are 120bp, but why when it’s passing through SILVA, the reads cames as 150bp?

  2. I am curious about this because I want to know when we look at the effectiveness of rRNA removal, do we usually look at it per total reads or per adapter removed reads?

  3. Our rRNA(%) ranges from 42%-82% for the reads after adapter removal, we have high adapter content (50%), so the rRNA content in the total reads ranges from 25%-38%. Is this normal? And is such a large variation normal?

  4. After all the QC steps (remove adapter, remove host genome and transcriptome, remove rRNA), we still have 66%-88% sequence duplication from our QC data. Should I remove the duplication for downstream analysis? The kneaddata manual suggested to but I am not sure if I will lose overexperssed genes if I do.

  5. I saw Huttenhower group 2018 Nature Microbiology Paper (Abu-Ali et al) https://www.nature.com/articles/s41564-017-0084-4 removed samples with metatranscriptome reads <1 million. Maybe I missed the specification in the paper, but I am not sure if this 1 million is after trimming off over represented sequences or before? Some of our samples falling short of 1 million reads after QC and over-represented sequence trimmed off even with the 70 million reads per end sequencing depth.
    Answers are greatly appreciated! @george-weingart @franzosa @sagunmaharjann
    Thank you!!
    Yike