Smoothing DNA-level features to avoid divide-by-zero errors


I’m looking to normalize paired RNA and DNA abundances from meta-samples analyzed with Humann3.

The Humann3 wiki suggests:

For low-abundance species, random sampling may lead to detection of transcripts for undetected genes. In these cases, we recommend smoothing DNA-level features to avoid divide-by-zero errors during normalization.

Are there commands/options to do this smoothing in Humann3, or do I perform this on my own? Perhaps it’s automatically done in the CPM normalization step? When I searched for this I couldn’t find it in the Humann2/3 wikis or tutorials, but I did see that Humann (legacy) included a Witten–Bell smoothed output file.


Sorry for the delayed reply here. This is actually an area where we’ve been doing some research recently to determine the best practices for combined RNA + DNA analysis (which haven’t made their way into HUMAnN yet).

If you’re looking to smooth zero values, my current suggestion is to do it per-feature, replacing zeroes with half the smallest non-zero feature for that measurement (smoothing features per-sample turns out to bake some unwanted signal of sequencing depth into the data). This procedure is also a lot easier to implement than the Witten-Bell method.

Thank you for the info! This is very helpful, I’ll give it a shot! I had found a py script called “humann_rna_dna_norm” in the biobakery3 bin and wondered if that was appropriate to use or not.

In regards to practices for RNA + DNA analysis it’s great to hear it’s something you’re actively working on! I’ve found this particularly troublesome when trying to interpret longitudinal data. For example I have a feature where the DNA levels peak at Time 4 and the RNA levels peak at Time 13. I tried a simple RNA:DNA ratio of the feature’s TPM, and that value peaks at Time 11. So simply saying “the peak abundance of this transcript after normalizing for gene abundance is Time 11” is likely true, but it also ignores the biologically significant interpretation of how disparate the peak DNA and RNA Times are from each other, and that in fact neither the DNA nor the RNA actually peak at Time 11. I suppose the answer is a mixed linear model where Time, DNA, and RNA values are all considered…but I digress.

Anyway, I wonder how the utility of any particular RNA:DNA normalization method differs if its being used for a simple differential expression comparison between two samples vs interpreting longitudinal datasets.

The humann_rna_dna_norm script is a legacy method (I don’t think it’s actually available in the path anymore?) that just computes the RNA/DNA ratio for normalization. That tends to work well on broad, community-level features (e.g. reactions or pathways), which also tend to be more abundant, and less well on species-specific genes, which can be much less abundant (and hence more sensitive to zeros / small counts).

As a preview of upcoming work, we’re starting to prefer using DNA as a covariate for RNA normalization rather than as a divisor. That results in models like…

log( RNA ) ~ log( DNA ) + covariates

Where covariates could include “time” in your case. You can think of such a model as relating samples’ residual RNA abundance (i.e. abundance that can’t be directly explained by gene copy ~ expression level) to the covariates.

This seems to work well across a range of RNA features, and we’ve found that performance gets even better if you restrict yourself to the samples where feature RNA and DNA were both non-zero (rather than smoothing the zeros).