HUMAnN v4.0.0.alpha.1 Raw Counts

Hello,

I’m currently looking at the output of HUMAnN v4.0.0.alpha.1 run with the --count-normalization Counts flag so I could run the output with DESeq2. When I inspected the genefamilies, I noticed out of ~250,000 values, around 6,000 were decimals. Is this expected from the Counts output? If so, what is the best way to handle these for a counts-based program?
Similarly, the metaphlan_profile did not contain counts, only estiamted_number_of_reads_from_the_clade. Does this act as an appropriate stand in? Any thoughts appreciated.

Not answering your question, but when using counts for reference sequences of various lengths (e.g., different genes from same taxon, same gene from different taxon, both of which occur often in humann’s databases), consider trying to account for the sequence length variation. Does 10 reads aligned to a 1000 bp gene mean the same thing as 10 reads aligned to a 5000 bp gene?

It may be that the devs did consider that a “count” was reads normalized by gene length (N reads / X gene length). In a sense, this would be comparable to, say 16S sequencing, where the reference sequence length is (relatively) uniform, and so there needs no normalization for read counts to be comparable. It may be that these could serve for DESeq2’s “counts”.

Based on what I understand of HUMANn3, the abundance quantification is complex (for example, note the name “gene families” instead of more simply “gene”), and so it may not always use read counts directly. Can you tell if the decimals are among certain strata, or primarily in the unclassified fractions, for which there is a different algorithm for read alignment (see humann2 docs and Have the basic work flow changed between HuManN version 2 and 3? )?

Note that DESeq2 was developed ( Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 | Genome Biology | Springer Nature Link ) for differential expression (gene in conditon 1 vs same gene in condition 2) of RNASeq (a transcriptome of a single organism) of humans (ie homo sapiens cells not microbes). When comparing read counts of a gene only to itself, length doesn’t matter. DESeq2 appears to take gene length into account only conditionally (Analyzing RNA-seq data with DESeq2), though their documentation about this subject more generally is rather unclear and focused on the same gene with distinct isoforms (mostly specific to Euks).
I would not consider myself an RNASeq or transcriptomics expert, but this, among other reasons, has always been a weakness of DESeq2 in my opinion, particularly for anything microbiome.

The default abundance units in HUMAnN 4 are CPMs, which are read counts normalized to gene length and sample depth. Previously HUMAnN reported RPKs, which are read counts normalized to gene length but NOT sample depth. If you ask HUMAnN 4 for counts, neither gene-length nor sample-depth normalization are performed. HOWEVER, because HUMAnN will still allow a read’s weight to be divided over multiple target sequences (i.e. in cases of ambiguous mapping), the “counts” are not always integers, especially in the context of translated search / unclassified abundances (where multiple-mapping is more common).

1 Like