Reason for subtracting bases number in abundance?

cquxiaoy · February 8, 2023, 9:32am

Hello,
As we know, abundance is computed as “read_num / marker_length”.
When computing abundance in metaphlan, the abundance is divided by “|marker_length - avg_read_length|+1” in stats: avg_g, tavg_g, tavg_l, wavg_l, med.
May I know why marker_length is handled like this?

cquxiaoy · February 10, 2023, 3:29am

Mainly wanna know why marker_length is subtracted by avg_read_length.
Is it just a method to adjust abundance borning in random expriment ?

franzosa · February 10, 2023, 9:34pm

This was posted under the HUMAnN topic, but I can answer it here because HUMAnN uses the same math.

When normalizing gene abundances you want to adjust for the alignable length of the gene. Since you can’t globally align a 100 nt read starting at the last position of the gene (or 2nd to last, or 10th to last, etc.), we don’t count those positions in the gene’s length when normalizing. The number of positions in a gene from which to start an alignment is (gene length - read length + 1), which we call the “alignable length.”

The absolute value corrects for corner cases where the gene is shorter than the read and so their roles switch in the logic above.

Here is a graphical explanation:

cquxiaoy · February 13, 2023, 1:20am

Thank you. I got it!

Topic		Replies	Views
Gene length normalization HUMAnN	2	1259	November 7, 2020
Metaphlan3 relative abundance MetaPhlAn	14	7285	June 9, 2025
Biobakery workflows absolute reads bioBakery workflows	0	88	April 9, 2024
Reverse "normalize by number of reads in the sample" ShortBRED	0	391	March 19, 2021
Understanding RPK values HUMAnN	3	163	June 9, 2025

Reason for subtracting bases number in abundance?

Related topics