Gene length normalization

It is written in the HUMAnN 2.0 User Manual:
“An alignment score is based on the number of matches to the reference gene for a specific sequence. It is divided by the length of the reference gene in kilobases to normalize for gene length.”

After reading this, I wondered why the information about read length is not included in the normalization, but only information on the reference gene length is. Then, I found one comment about RPKMs:
“RPKMs are good for both within- and between-sample comparison. The “K” part provides gene-length normalization, which allows you to compare between genes of different lengths (e.g. in the same sample).”

However, because not all reads have the same length even in the same sample, this potentially includes a bias in your normalization.
I will try to make my point clear:
From my perspective, it is important to normalize the gene counts because longer genes tend to be over-represented, as they have are more frequently split among several reads. Short reads are less affected by this.
However, it is possible to think about a scenario when it is clear dividing the gene count by its lenght introduces a new bias:
If a hypothetical technology can sequence the entire genome using a single read, the gene counting will be correct from the start. However, after dividing by the length of the reference gene, the longer the genes are, the more sub-represented they will be.

Another approach would be to divide the gene counts by an adjustment parameter that take into account the read length. This parameter would be calculated by answering the question: On average, into how many reads of size Y, a gene of size X would be split?

For example, if “gene length” = 101, and “read length” = 100, the gene will always be split in 2 reads, and counted twice. Therfore, adjustmentParameter = 2.
If “gene length” = 100, and “read length” = 100, the adjustmentParameter = 1.99, because only in 1 case out of 100, the gene would not be split in 2 reads.

A mathematical formula that answers the question “In average, in how many reads of size Y, a gene of size X would be splited into?” can be written as:

normalizedCount = geneCount/adjustmentParameter
adjustmentParameter = 1 + flor(L/d) + (L - flor(L/d)*d -1)/d
L = gene length
d = read length
flor(X): https://en.wikipedia.org/wiki/Floor_and_ceiling_functions

However, the read length is not constant across the sample, which is not taken into account in the mentioned formula. So, it would be necessary to have a different count of genes for each read length. Then, apply the normalization formula, and sum up all partial counts. But is probably much easear to just increase the gene count by 1/adjustmentParameter, for each new gene hit.

I’m still a new Humman2 user. Please forgive me if I failed to understand RPKM, and how your normalization works.

Kind regards,
Rodrigo.

The manual is giving an abbreviated explanation. We actually normalize on a per-read basis adjusting for both the length of the gene and the length of the alignment (if you require the entire read to align to a gene, then the alignment length is the same as the read length).

More specifically, if a read aligns to a gene of length L via an alignment of length A, then we update the gene’s abundance in RPKs by W * 1000 / (L - A + 1), where W is the weight of the alignment (W = 1 if the alignment is not weighted).

We then recommend sum-normalizing gene-level RPKs to relative abundance or (more conveniently) copies per million units (analogous to the TPM units of RNA-seq). Copies per million units are advantageous over RPKMs in that they don’t depend on the length distribution of genes in the sample. Please see the link below for more details about the RPKM vs TPM difference:

2 Likes

Thank you very much for the more detailed explanation! I understand it better now. :slight_smile: