Question about ribosomal_RNA database

Hello!
I downloaded the SILVA ribosomal RNA database from your website (downloaded directly because kneaddata_database failed for me), and I have a question. Looking at the index with bowtie2-inspect, it looks like the reference sequences contain only A/C/G and that the U’s present in the SILVA fasta files have been removed. Building a custom database using bowtie2-build with the downloaded FASTA files has the same problem (is it best to just convert the U’s to T’s in the SILVA fasta file?). My apologies if I’m missing something obvious - this is my first analysis with kneaddata, and my first analysis where I actually care about characterizing rRNA in an experiment.

In any case, thanks for developing this very useful tool!
Michael

PS - I actually posted this question at https://bitbucket.org/biobakery/kneaddata/issues, but I assume this is a better spot…

Hello Michael !
I looked at the “$KNEADDATA_DB_RIBOSOMAL_RNA” database, looked at the fasta records used to create the Bowtie2 index and I see lots of "U"s, so can you elaborate?
I am enclosing a screen print.
Let me know…
Best regards,
George Weingart PhD
Huttenhower Lab

Hi George,
Thanks for your response. I agree that the original FASTA file for SILVA contains many U’s. The bowtie2 distribution includes an executable “bowtie2-inspect” which takes a Bowtie2 index an prints out the underlying FASTA sequence of the index. Running it on the pre-built kneaddata index (SILVA_128_LSUParc_SSUParc_ribosomal_RNA.1.bt2l, etc.) yields the following:

% bowtie2-inspect SILVA_128_LSUParc_SSUParc_ribosomal_RNA | head -6

GCVF01000057.1.1978 Eukaryota;Amoebozoa;Thalassiosira rotula
CGAAAGACAACGAACCACAGAGCGGCCCCCGAAGCCCCAGGAAGCGGAACAAAAAAGACA
GGAAAGCGAAGAAGAAGCCGGGGAGAAACACCCGACACCAAACAAAAGGAAGAAGGAAAG
CAAAGAAACAGAAAACAAGCAGGGGCCAGGAAGCAGAACGGCGAGGGGAGAACCAAACGG
AAGAAAGGCCAAAAGACGCACAGAACCACAAAAGGGGACACAGACAGGGACGGGGCCAGG
AAGCGGAACCGCAAGGAGGGAACAACCACCAACCGAAGAAAGCCCGAAAAGGAGGCGCAA

You can see that this is precisely the start of the nucleotide sequence for GCVF01000057.1.1978, with all of the U’s omitted. So it isn’t clear to me that the index is correct for use in aligning our data. Perhaps it is a quirk of bowtie2-inspect, but it certainly seems odd.
Best,
Michael

Thanks for pointing this out, @helicam. We confirmed in a recent evaluation that bowtie2 is quietly dropping the U characters from the SILVA sequences during index construction rather than treating them as T-equivalents (an unusual choice relative to other nucleotide-level aligners). We’ve now corrected this issue by manually replacing the Us with Ts prior to indexing, which substantially improves the performance of the SILVA database.

The updated database file is available here. You can automatically update your local database using this command:

kneaddata_database --download ribosomal_RNA bowtie2 $DIR

I’ll pin a separate message about this to the KneadData board so that others users are aware of this important fix.

Sorry for re-opening this question.

Well, before indexing SILVA database, the Us have been replaced by Ts, as described in this link Creating SILVA ribosomal_RNA Database.

However, if the data is from metatranscriptomics, as we know RNA will be first converted to cDNA (may be strand-specific), and then be sequenced. Should the reference sequences in SILVA database be reverse complented first (AAUUCCGG→CCGGAATT) or still be replaced by Ts (AAUUCCGG → AATTCCGG)?

Thanks in advance!

Mort