HUMAnN pathway coverage: community total ≪ species coverage. Why can this happen?

In a single HUMAnN pathcoverage.tsv, is it ever expected that the community (unstrat) coverage for a pathway is smaller than a species-stratified coverage for the same pathway (beyond tiny rounding)?

Example (PWY-7237, same sample):

# Pathway ERR2508876_Coverage

PWY-7237: ... 0.0000000523 # community / unstrat

PWY-7237: ...|g__Anaerostipes.s__A._hadrus 0.9901258490 # species

Coverage is a 0–1 presence/completeness measure. If any species has coverage ≈ 1.0, I would expect the community coverage (built from all species) to be that value, not ≪ it.

  1. s there any legitimate situation where species coverage > community coverage in the same table and sample (other than tiny rounding)?
  2. Does biobakery_workflows ever renormalize coverage during summary steps, or could this indicate post-processing that should be avoided?
  3. Recommended fix: should we regenerate “clean” coverage from a known-good genefamilies.tsv (e.g., humann --input <…> --input-format genefamilies), and avoid any renorm on coverage?

To help answer (1), I am not a dev, but as far as I understand, community coverage should only very rarely (never) be greater than species coverage, since a species is a part of the community.
For a community of 10 members, and we might consider a gene like DNA polymerase, for which every species has one. For each species, DNA pol coverage can be up to 1, but for the community the coverage may be 0.1. 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8. 0.9, or 1.0 depending on which member recruits some reads to its DNA pol.
The actual calculations of HUMAnN3 are much more complex (outlined here https://groups.google.com/g/humann-users/c/oaD4BEZWz1o), but I think the logic still applies.

The above link may also help answer (2).

Notably, the coverage calculation was directly ported from HUMAnN 1, where it was developed with community totals in mind. We’ve applied it to species as well in HUMAnN 2 and 3 for consistency, but doing so can result in unintuitive results like this one (notably, coverage has been on the chopping block for a while and is officially retired in HUMAnN 4).

As context, coverage is telling you if the reactions involved in a particular pathway were among the most confidently detected (when ranking reactions by abundance). So a low coverage at the community level suggests that the pathway might be driven by one or more unusually lowly abundant reactions, which are at greater risk of being false positives. Hence, in the HUMAnN 1 days, filtering on coverage provided a way to weed out potential false positive pathway reconstructions. But in HUMAnN 2+ we have much improved our specificity for individual reaction quantification greatly. That, combined with the greater context provided by species-stratified results in HUMAnN 2+, superseded the need for a separate coverage calculation IMO.

Back to your example: Because the reactions within a particular species are expected to be roughly evenly covered (modulo unevenness of read sampling and copy number variation), this notion of “lower abundance reaction = less confident detection” starts to break down. In this case, HUMAnN is saying that the reactions in this pathway were unusually lowly abundant at the community level (hence a low coverage value for the community), but not atypical compared to other reactions in A. hadrus (hence a high coverage for A. hadrus).