Hello!
I downloaded the SILVA ribosomal RNA database from your website (downloaded directly because kneaddata_database failed for me), and I have a question. Looking at the index with bowtie2-inspect, it looks like the reference sequences contain only A/C/G and that the U’s present in the SILVA fasta files have been removed. Building a custom database using bowtie2-build with the downloaded FASTA files has the same problem (is it best to just convert the U’s to T’s in the SILVA fasta file?). My apologies if I’m missing something obvious - this is my first analysis with kneaddata, and my first analysis where I actually care about characterizing rRNA in an experiment.
In any case, thanks for developing this very useful tool!
Michael
Hello Michael !
I looked at the “$KNEADDATA_DB_RIBOSOMAL_RNA” database, looked at the fasta records used to create the Bowtie2 index and I see lots of "U"s, so can you elaborate?
I am enclosing a screen print.
Let me know…
Best regards,
George Weingart PhD
Huttenhower Lab
Hi George,
Thanks for your response. I agree that the original FASTA file for SILVA contains many U’s. The bowtie2 distribution includes an executable “bowtie2-inspect” which takes a Bowtie2 index an prints out the underlying FASTA sequence of the index. Running it on the pre-built kneaddata index (SILVA_128_LSUParc_SSUParc_ribosomal_RNA.1.bt2l, etc.) yields the following:
% bowtie2-inspect SILVA_128_LSUParc_SSUParc_ribosomal_RNA | head -6
You can see that this is precisely the start of the nucleotide sequence for GCVF01000057.1.1978, with all of the U’s omitted. So it isn’t clear to me that the index is correct for use in aligning our data. Perhaps it is a quirk of bowtie2-inspect, but it certainly seems odd.
Best,
Michael
Thanks for pointing this out, @helicam. We confirmed in a recent evaluation that bowtie2 is quietly dropping the U characters from the SILVA sequences during index construction rather than treating them as T-equivalents (an unusual choice relative to other nucleotide-level aligners). We’ve now corrected this issue by manually replacing the Us with Ts prior to indexing, which substantially improves the performance of the SILVA database.
The updated database file is available here. You can automatically update your local database using this command:
However, if the data is from metatranscriptomics, as we know RNA will be first converted to cDNA (may be strand-specific), and then be sequenced. Should the reference sequences in SILVA database be reverse complented first (AAUUCCGG→CCGGAATT) or still be replaced by Ts (AAUUCCGG → AATTCCGG)?