Normalization with lots of unmapped humann reads

[1] ‘1.14.1’

Hello, I have sort of a theoretical question regarding normalization of my data. Im running humann3 with a custom database containing protein sequences from only a few key enzymes of interest. As a result, the vast majority of the reads present in the metagenomic samples remain unmapped, which is fine for my purposes. Typically I would re-normalize the RPKs to relative abundance and run them in maaslin2, however, as the vast majority of the reads are in the unmapped (>0.99 relab for most samples) I have been leaving the humann results in RPK, dropping the unmapped row and then doing CLR normalization before maaslin2 analysis. Wondering if anyone has some input as to whether this is sound statistically, or if there is something I’m not considering. Thank you for your help!

If you want to talk about the relative composition among the small number of enzymes of interest across samples, then ignoring the unmapped fraction and computing relative abundance / CPMs / CLR values over the enzymes’ RPKs should all get the job done.

If you’re interested at all in the total abundance of these enzymes (as a fraction of the total sequencing depth), then you lose that signal with the above approaches. If you want to retain this signal, which is often the case when working with functions of interest rather than broad functional profiles, some approaches I’ve used in the past are 1) computing RPKMs using the total reads (not the mapped reads) for the M in the RPKM calculation or 2) doing some flavor of genome-size normalization (i.e. comparing the enzymes’ RPKs to RPKs of housekeeping genes). MUSiCC and MicrobeCensus are options for #2.

Here’s an example of work where we adjusted the abundances of some enzymes of interest according to the average genome size of HMP samples: