The bioBakery help forum

Question about ribosomal_RNA database

Hello!
I downloaded the SILVA ribosomal RNA database from your website (downloaded directly because kneaddata_database failed for me), and I have a question. Looking at the index with bowtie2-inspect, it looks like the reference sequences contain only A/C/G and that the U’s present in the SILVA fasta files have been removed. Building a custom database using bowtie2-build with the downloaded FASTA files has the same problem (is it best to just convert the U’s to T’s in the SILVA fasta file?). My apologies if I’m missing something obvious - this is my first analysis with kneaddata, and my first analysis where I actually care about characterizing rRNA in an experiment.

In any case, thanks for developing this very useful tool!
Michael

PS - I actually posted this question at https://bitbucket.org/biobakery/kneaddata/issues, but I assume this is a better spot…

Hello Michael !
I looked at the “$KNEADDATA_DB_RIBOSOMAL_RNA” database, looked at the fasta records used to create the Bowtie2 index and I see lots of "U"s, so can you elaborate?
I am enclosing a screen print.
Let me know…
Best regards,
George Weingart PhD
Huttenhower Lab

Hi George,
Thanks for your response. I agree that the original FASTA file for SILVA contains many U’s. The bowtie2 distribution includes an executable “bowtie2-inspect” which takes a Bowtie2 index an prints out the underlying FASTA sequence of the index. Running it on the pre-built kneaddata index (SILVA_128_LSUParc_SSUParc_ribosomal_RNA.1.bt2l, etc.) yields the following:

% bowtie2-inspect SILVA_128_LSUParc_SSUParc_ribosomal_RNA | head -6

GCVF01000057.1.1978 Eukaryota;Amoebozoa;Thalassiosira rotula
CGAAAGACAACGAACCACAGAGCGGCCCCCGAAGCCCCAGGAAGCGGAACAAAAAAGACA
GGAAAGCGAAGAAGAAGCCGGGGAGAAACACCCGACACCAAACAAAAGGAAGAAGGAAAG
CAAAGAAACAGAAAACAAGCAGGGGCCAGGAAGCAGAACGGCGAGGGGAGAACCAAACGG
AAGAAAGGCCAAAAGACGCACAGAACCACAAAAGGGGACACAGACAGGGACGGGGCCAGG
AAGCGGAACCGCAAGGAGGGAACAACCACCAACCGAAGAAAGCCCGAAAAGGAGGCGCAA

You can see that this is precisely the start of the nucleotide sequence for GCVF01000057.1.1978, with all of the U’s omitted. So it isn’t clear to me that the index is correct for use in aligning our data. Perhaps it is a quirk of bowtie2-inspect, but it certainly seems odd.
Best,
Michael