Single copy gene normalization

In a metagenome, you can calculate the estimated number of copies of a gene per genome (let’s call it CPG). One way to estimate this value is to first calculate, for each sample, the geometric mean of the abundances of a set of universal single-copy genes (let’s call it the SCG geometric mean), and then normalize the abundance of the genes by this SCG geometric mean. I would like to use MaAsLin3 to compare the CPG of my genes between groups. To do this, I would input the abundance table normalized by the SCG geometric mean, with normalization = NONE and median_comparison_abundance = FALSE. Do you think this is a valid use of MaAsLin3?

Hi @danpal,

I’m not particularly familiar with this type of normalization but it sounds sort of similar to a CLR (without the log and using a different reference frame etc.). Which worked fine in MaAsLin3 although note, in our testing we did find that the defaults for MaAsLin3 worked better than CLR.

As such I cannot guarantee whether this normalization will led to any change in performance. If you plan to do this analysis I would make sure at the very least to look at the diagnostic plots/plot the relationships your interested in to make sure they look reasonable. Moreover, I would encourage you to think deeply about your normalization and the biological question you are asking by using it. I think what you described is reasonable but it’s important to realize that by normalizing the data in this way you are altering the type of question you are asking as to compared to using something like TSS.

Cheers,
Jacob Nearing

1 Like

Thanks for the response. In my case, I mainly use it for the normalization of antimicrobial resistance genes. If I have a CPG value of 2, it means that, on average, each bacterium has two copies of that gene. I think this measure is much more interpretable for genes than simply expressing that a gene has a certain relative abundance within the pool of all the genes being considered.

1 Like