Hey All!
I wanted to ask the percentage mapped reads that are reported after mapping to a database is in RPK or raw read counts. I went through the tutorial, and I found the following line-
" * The “UNMAPPED” value is the total number of reads which remain unmapped after both alignment steps (nucleotide and translated search). Since other gene features in the table are quantified in RPK units, “UNMAPPED” can be interpreted as a single unknown gene of length 1 kilobase recruiting all reads that failed to map to known sequences."
Does this mean that when I am reporting 50% reads mapped to bacteria in my sample, it is actually in RPK units and not raw reads?
Looking forward to your reply.
Regards
Jigyasa
Before you normalize the genefamilies abundances, the value for “UNMAPPED” is the literal count of reads that HUMAnN didn’t assign to any sequences (note: these reads may have had hits to sequences, but HUMAnN rejected the hits for being low-confidence). To convert that number to a % unmapped reads you’d need to divide by the number of sequencing reads in the sample.
When you normalize the file, that’s when we’re treating that UNMAPPED count as if it were in RPK units (like the other genes in the file) so that it makes sense to compute a sum over the file. If you normalize to relative abundance units, the new value for UNMAPPED would be close to the true % unmapped value described above, but off by a little bit because the RPK values for the genes in the file are not the same as raw read counts. Put another way, the sum over genes’ RPK values is not the same as the total number of reads in the sample.