I’m using HUMAnN4-alpha with Chocophlan_vOct22_202403 and everything’s working fine.
I just have a question about normalization.
HUMAnN4a’s default output is adjusted CPM, but can you tell me about the logic that calculates “adjusted” CPM from RPK?
I couldn’t find it in the tutorial or forum, so I’m asking.
If you’ve come across anything, please let me know.
Also, do you know how to calculate adjusted RPK?
I’m also trying to apply MicrobeCensus to my data and use RPKG, but I didn’t consider this when running HUMAnN4a, so I only have the default output, the adjusted CPM table.
Is there a script or method to convert the output from adjusted CPM to RPK?
It’d take too long to run HUMAnN4a again because there’s too much data…
I am always grateful for the contributions of the biobakery team and forum members.
Kirby
I found the method for count normalization option of humann4a in source code, store.py.
in my understands,
RPKs: normalization just with genelength/1000
Adjsuted RPKs: normalization by (abs(genelength - readlength)+1)/1000
Adjusted CPMs: 1 million normalization for total sum of adjusted RPKs
Please give advice if I’m wrong.
However, the question still remaining is: “Is there a method to convert adjusted CPMs (or adjusted RPKs) to RPKs?”
Additionally, I’m wondering: “which of RPKG transformation after or before read length adjustment or RPKG tranformation only do you think is more reasonable conceptually?”
I guess RPKG after adjustment would be the most reasonable, but… It’s just a guess.
I’d be so grateful for any advice!
You found the right piece in the code. RPK is “reads / gene length in kilobases.” Adjusted RPK accounts for the fact that reads can’t align to the full gene length (since some would fall off the end of the gene). Hence adjusted RPK uses a modified length based on the read length. (Adjusted) CPM then just sum-normalizes adjusted RPKs to 1 million.
There is no way to back-calculate from CPMs to RPKs because the information about the original sum is lost. If you have your HUMAnN intermediate files saved I believe you can rerun in --resume mode and select another normalization option (without having to do all the mapping over again).
If you just need something for genome-size normalization (GSN), then CPMs should still work. The idea with GSN is to normalize against an abundance X that you think represents the abundance of universal single-copy genes, often computed as an average over such individual genes. It shouldn’t matter what units X is in.
For example, let’s say that a gene Y has abundance 0.1 CPM. You compute X (as defined above) as 2.0 CPM. Then this suggests that Y is found in 1/20 genomes in the community (i.e. 0.1 CPM / 2.0 CPM).