Reason for subtracting bases number in abundance?

Hello,
As we know, abundance is computed as “read_num / marker_length”.
When computing abundance in metaphlan, the abundance is divided by “|marker_length - avg_read_length|+1” in stats: avg_g, tavg_g, tavg_l, wavg_l, med.
May I know why marker_length is handled like this?

Mainly wanna know why marker_length is subtracted by avg_read_length.
Is it just a method to adjust abundance borning in random expriment ?

This was posted under the HUMAnN topic, but I can answer it here because HUMAnN uses the same math.

When normalizing gene abundances you want to adjust for the alignable length of the gene. Since you can’t globally align a 100 nt read starting at the last position of the gene (or 2nd to last, or 10th to last, etc.), we don’t count those positions in the gene’s length when normalizing. The number of positions in a gene from which to start an alignment is (gene length - read length + 1), which we call the “alignable length.”

The absolute value corrects for corner cases where the gene is shorter than the read and so their roles switch in the logic above.

Here is a graphical explanation:

1 Like

Thank you. I got it!