The output includes 2_genefamilies.tsv , and I noticed that the first few lines contain unmapped reads:
Gene Family HUMAnN v4.0.0.alpha.1 Adjusted CPMs sample1
READS_UNMAPPED 302380430.0000000000
However, when I sum up all other gene families (excluding READS_UNMAPPED), their total is approximately 1,000,000 CPM (counts per million). This makes sense because HUMAnN reports in CPM units. But the READS_UNMAPPED value is in absolute counts (~302 million), which is much larger than 1 million.
So my question is: Is it normal for READS_UNMAPPED to be reported in raw read counts while all other features are in CPM? It seems inconsistent, and adding READS_UNMAPPED to the sum of other features does not yield 1 million, which would be expected if all values were in the same unit.
I’ve run this twice and got the same result, so it’s likely not a processing error. Could someone clarify how READS_UNMAPPED is calculated and why it’s not in CPM like the rest of the output?
We made the decisions in HUMAnN 4 to 1) automatically do sum-normalization to CPMs vs. leaving features in RPK units, 2) keep including the UNMAPPED read count in the output (for consistency), but 3) not include this count in the CPM normalization.
We opted to add the automatic normalization since it was preferable to be computing downstream abundances (e.g. pathway abundance) from pre-normalized gene abundances. It was also inconvenient for users that, barring unusual applications, it was almost always the right step to immediately normalize the method’s outputs after they were generated.
The reason for not including the UNMAPPED reads in this normalization is two-fold. 1) Not doing so is more consistent with other areas of shotgun sequencing analysis, where abundances are usually expressed as fractions of MAPPED reads (vs. total reads). 2) Gene abundances are initially computed in RPK units (to adjust for differences in length) before sum normalizing. This is a critical step, but it results in the gene abundances and UNMAPPED abundance being in different units (RPKs vs. reads). Hence, it’s not strictly correct to include them in the same sum during normalization, although doing so is probably not a terrible approximation in practice.
This is something we plan to revisit in future releases. I think a nice alternative might be to compute CPMs over genes, as we do now, but then scale those abundances to fill the % MAPPED reads of the sample (similar to how MetaPhlAn handles % UNKNOWN estimation). This would then place the UNMAPPED read mass and other features in the same composition.