The bioBakery help forum

Normalize ShortBRED Reads

How do we normalize hit numbers to the sample?
Is it reasonable to normalize hits to the number of reads in the FASTA file?

The “Count” column in the ShortBRED quantify output is in units of RPKMs, which are already adjusted for sequencing depth (number of reads in the sample). Hence you should be able to analyze those directly.

Thanks very much. I believe RPKM is best for comparing different types of marker within a sample, whereas Hits would work best across samples.
I’m wondering why Counts (RPKM) and Hits give such different results. Why would some of my top-scoring Hits markers show zero RPKMs?

Family Count Hits TotMarkerLength
gb AAB58447_1 ARO_3002662 APH(9)-Ia__Legionella_pneumophila_130b_
gb AHA41505_1 ARO_3002930 vanRO__Rhodococcus_hoagii_
gb AAD51345_1 ARO_3003052 smeB__Stenotrophomonas_maltophilia_
gb AAF86691_1 ARO_3001816 ACC-2__Hafnia_alvei_
gb CAC14596_1 ARO_3003057 smeF__Stenotrophomonas_maltophilia_
Family Count Hits TotMarkerLength
gb AAB58447_1 ARO_3002662 APH(9)-Ia__Legionella_pneumophila_130b_
gb AHA41505_1 ARO_3002930 vanRO__Rhodococcus_hoagii_
gb AAD51345_1 ARO_3003052 smeB__Stenotrophomonas_maltophilia_
gb AAF86691_1 ARO_3001816 ACC-2__Hafnia_alvei_
gb CAC14596_1 ARO_3003057 smeF__Stenotrophomonas_maltophilia_

That didn’t work; here are the counts|hits.

Count Hits
0 127
14.68017047 63
0.192048279 17
0 15
2.095746364 13

RPKMs are good for both within- and between-sample comparison. The “K” part provides gene-length normalization, which allows you to compare between genes of different lengths (e.g. in the same sample). The “M” part provides read-depth normalization, which lets you compare between samples of different read depths.

The “Hits” column is the integer count of reads that hit one of the markers for a particular family. If a family has multiple markers, but <50% of markers recruited reads, then ShortBRED doesn’t consider the family to be confidently detected and reports Count = 0.

Thanks for this thread. Is there any way to change the parameter of <50% of marker recruited reads so that the count report is reflective of hits? Also, is there a downside to changing this parameter if you can? I don’t understand how if you have some hits, that it would be listed as not present? Is this a false positive?

Also, one other question. Would having 250 base pair (miseq) reads affect the number of hits?

Thank you for your time.

John