It is written in the HUMAnN 2.0 User Manual:
“An alignment score is based on the number of matches to the reference gene for a specific sequence. It is divided by the length of the reference gene in kilobases to normalize for gene length.”
After reading this, I wondered why the information about read length is not included in the normalization, but only information on the reference gene length is. Then, I found one comment about RPKMs:
“RPKMs are good for both within- and between-sample comparison. The “K” part provides gene-length normalization, which allows you to compare between genes of different lengths (e.g. in the same sample).”
However, because not all reads have the same length even in the same sample, this potentially includes a bias in your normalization.
I will try to make my point clear:
From my perspective, it is important to normalize the gene counts because longer genes tend to be over-represented, as they have are more frequently split among several reads. Short reads are less affected by this.
However, it is possible to think about a scenario when it is clear dividing the gene count by its lenght introduces a new bias:
If a hypothetical technology can sequence the entire genome using a single read, the gene counting will be correct from the start. However, after dividing by the length of the reference gene, the longer the genes are, the more sub-represented they will be.
Another approach would be to divide the gene counts by an adjustment parameter that take into account the read length. This parameter would be calculated by answering the question: On average, into how many reads of size Y, a gene of size X would be split?
For example, if “gene length” = 101, and “read length” = 100, the gene will always be split in 2 reads, and counted twice. Therfore, adjustmentParameter = 2.
If “gene length” = 100, and “read length” = 100, the adjustmentParameter = 1.99, because only in 1 case out of 100, the gene would not be split in 2 reads.
A mathematical formula that answers the question “In average, in how many reads of size Y, a gene of size X would be splited into?” can be written as:
normalizedCount = geneCount/adjustmentParameter
adjustmentParameter = 1 + flor(L/d) + (L - flor(L/d)*d -1)/d
L = gene length
d = read length
However, the read length is not constant across the sample, which is not taken into account in the mentioned formula. So, it would be necessary to have a different count of genes for each read length. Then, apply the normalization formula, and sum up all partial counts. But is probably much easear to just increase the gene count by 1/adjustmentParameter, for each new gene hit.
I’m still a new Humman2 user. Please forgive me if I failed to understand RPKM, and how your normalization works.