How do we normalize hit numbers to the sample?
Is it reasonable to normalize hits to the number of reads in the FASTA file?
The “Count” column in the ShortBRED quantify output is in units of RPKMs, which are already adjusted for sequencing depth (number of reads in the sample). Hence you should be able to analyze those directly.
Thanks very much. I believe RPKM is best for comparing different types of marker within a sample, whereas Hits would work best across samples.
I’m wondering why Counts (RPKM) and Hits give such different results. Why would some of my top-scoring Hits markers show zero RPKMs?
Family | Count | Hits | TotMarkerLength |
---|---|---|---|
gb | AAB58447_1 | ARO_3002662 | APH(9)-Ia__Legionella_pneumophila_130b_ |
gb | AHA41505_1 | ARO_3002930 | vanRO__Rhodococcus_hoagii_ |
gb | AAD51345_1 | ARO_3003052 | smeB__Stenotrophomonas_maltophilia_ |
gb | AAF86691_1 | ARO_3001816 | ACC-2__Hafnia_alvei_ |
gb | CAC14596_1 | ARO_3003057 | smeF__Stenotrophomonas_maltophilia_ |
Family | Count | Hits | TotMarkerLength |
---|---|---|---|
gb | AAB58447_1 | ARO_3002662 | APH(9)-Ia__Legionella_pneumophila_130b_ |
gb | AHA41505_1 | ARO_3002930 | vanRO__Rhodococcus_hoagii_ |
gb | AAD51345_1 | ARO_3003052 | smeB__Stenotrophomonas_maltophilia_ |
gb | AAF86691_1 | ARO_3001816 | ACC-2__Hafnia_alvei_ |
gb | CAC14596_1 | ARO_3003057 | smeF__Stenotrophomonas_maltophilia_ |
That didn’t work; here are the counts|hits.
Count | Hits |
---|---|
0 | 127 |
14.68017047 | 63 |
0.192048279 | 17 |
0 | 15 |
2.095746364 | 13 |
RPKMs are good for both within- and between-sample comparison. The “K” part provides gene-length normalization, which allows you to compare between genes of different lengths (e.g. in the same sample). The “M” part provides read-depth normalization, which lets you compare between samples of different read depths.
The “Hits” column is the integer count of reads that hit one of the markers for a particular family. If a family has multiple markers, but <50% of markers recruited reads, then ShortBRED doesn’t consider the family to be confidently detected and reports Count = 0.
Thanks for this thread. Is there any way to change the parameter of <50% of marker recruited reads so that the count report is reflective of hits? Also, is there a downside to changing this parameter if you can? I don’t understand how if you have some hits, that it would be listed as not present? Is this a false positive?
Also, one other question. Would having 250 base pair (miseq) reads affect the number of hits?
Thank you for your time.
John