READS_UNMAPPED Is Not Presented in CPM Format

Hi there,

I’m running HUMAnN v4.0.0.alpha.1 on paired-end metagenomic data using the following command

zcat sample1_1.fq.gz sample1_2.fq.gz > sample1.fq.gz  
humann -i sample1.fq.gz --output sample1_out --input-format fastq.gz --threads 12 --remove-temp-output --metaphlan-options " --db_dir mpa_vOct22 -t rel_ab_w_read_stats " --memory-use maximum

The output includes 2_genefamilies.tsv , and I noticed that the first few lines contain unmapped reads:

Gene Family    HUMAnN v4.0.0.alpha.1 Adjusted CPMs    sample1  
READS_UNMAPPED    302380430.0000000000

However, when I sum up all other gene families (excluding READS_UNMAPPED), their total is approximately 1,000,000 CPM (counts per million). This makes sense because HUMAnN reports in CPM units. But the READS_UNMAPPED value is in absolute counts (~302 million), which is much larger than 1 million.

So my question is: Is it normal for READS_UNMAPPED to be reported in raw read counts while all other features are in CPM? It seems inconsistent, and adding READS_UNMAPPED to the sum of other features does not yield 1 million, which would be expected if all values were in the same unit.

I’ve run this twice and got the same result, so it’s likely not a processing error. Could someone clarify how READS_UNMAPPED is calculated and why it’s not in CPM like the rest of the output?

Best,
Thank you!

We made the decisions in HUMAnN 4 to 1) automatically do sum-normalization to CPMs vs. leaving features in RPK units, 2) keep including the UNMAPPED read count in the output (for consistency), but 3) not include this count in the CPM normalization.

We opted to add the automatic normalization since it was preferable to be computing downstream abundances (e.g. pathway abundance) from pre-normalized gene abundances. It was also inconvenient for users that, barring unusual applications, it was almost always the right step to immediately normalize the method’s outputs after they were generated.

The reason for not including the UNMAPPED reads in this normalization is two-fold. 1) Not doing so is more consistent with other areas of shotgun sequencing analysis, where abundances are usually expressed as fractions of MAPPED reads (vs. total reads). 2) Gene abundances are initially computed in RPK units (to adjust for differences in length) before sum normalizing. This is a critical step, but it results in the gene abundances and UNMAPPED abundance being in different units (RPKs vs. reads). Hence, it’s not strictly correct to include them in the same sum during normalization, although doing so is probably not a terrible approximation in practice.

This is something we plan to revisit in future releases. I think a nice alternative might be to compute CPMs over genes, as we do now, but then scale those abundances to fill the % MAPPED reads of the sample (similar to how MetaPhlAn handles % UNKNOWN estimation). This would then place the UNMAPPED read mass and other features in the same composition.

1 Like

I’m glad I found this post! I was really scared because I initially normalized including those UNMAPPED reads, and the abundances of my genes were even lower than I’d expected.

However, I’m a little confused about 1) where the number of raw UNMAPPED reads comes from and 2) how this normalization affects calculations for pathway abundances.

  1. I’m not sure where the number of raw UNMAPPED reads comes from, because I searched for this number in the humann log file with ctrl+F, and had no hits. I also found these lines in the log file (not next to each other in log file):
11/20/2025 05:20:30 PM - humann.utilities - DEBUG: b'73413966 reads;
11/20/2025 07:26:44 PM - humann.humann - INFO: Unaligned reads after translated alignment: 37.5111787313 %

and multiplied that percentage by the number of total reads, and that did not match the UNMAPPED reads count.

  1. I’m more interested in this question: how can I interpret pathway abundances knowing that Uniref IDs have been normalized to CPM? Are the pathway abundance sums in the pathabundance file generated from the CPM uniref counts or RPK units? Is the UNMAPPED number in the pathabundance file the raw reads number just scaled by the compression constant k?

Thank you for your guidance– I’m trying to figure out how to accurately reflect the amount of unmapped reads for later analysis, where keeping in the abundance of unmapped reads might help prevent artificial alterations in housekeeping gene abundance.

Best, Gillian