Hi, I’ve run Humann2 (using default settings and the chocophlan database) on human gut shotgun metagenomic samples that has been quality trimmed and dehosted (mapped against hg38) but I still find mammalian pathways with high abundance, eg.
COA-PWY-1: coenzyme A biosynthesis II (mammalian)
LYSINE-DEG1-PWY: L-lysine degradation XI (mammalian)
PWY-6309: L-tryptophan degradation XI (mammalian, via kynurenine)
I know that Humann2 is quite conservative, but is it possible these are false positives? Or is there some alternate explanation?
I’d chalk these up to issues with the pathway definition/annotation rather than host contamination. While (e.g.) COA-PWY-1 is annotated as a mammalian pathway in MetaCyc, it has historically been annotated to many species in BioCyc, with these being well-represented genera:
Clostridium 31
Acinetobacter 42
Staphylococcus 50
Enterococcus 57
Streptococcus 228
This can result from one of two issues: 1) the pathway is poorly named or 2) the pathway is very similar to a more broadly distributed pathway (e.g. COA-PWY) and the tolerance built into the annotation system has a hard time assigning one and not the other.
Issue 2 is relevant to HUMAnN2’s pathway reconstruction as well, since the default --gap-fill on
mode will report a pathway if a single reaction is missing, which could be a reaction that distinguishes (e.g.) COA-PWY and COA-PWY-1. Turning off gap-filling will be more conservative.
1 Like
Thanks Eric, that makes sense!
As for the second issue, could we say that the default --gap-fill on
is better for low sequence coverage samples (to avoid false negatives) and that --gap-fill off
is better for high sequence coverage samples ( to avoid false positives)?
Looking at the definition of gap filling:
Gap filling allows for a single required reaction to have a zero abundance. For all pathways, the required reaction with the lowest abundance is replaced with the abundance of the required reaction with the second lowest abundance.
Doesn’t this bias pathways with fewer reactions to be detected more frequently with --gap-fill on
. Assume a pathway has 3 reactions: a single pathway could be missing (33% missing) and it can still be detected. Whereas a complicated pathway has 30 reactions: two pathways of which is missing (6.6% missing) yet it won’t be detected. You’d think that the former pathway is less complete, thus less likely to be real? Or am I misunderstanding the principle behind metabolic pathways?
There are really two independent layers where the gap filling helps. One of them is what you proposed, i.e. at <1x coverage-breadth we expect to under-sample some genes, which will cause us to miss reactions (and hence pathways) as false negatives.
The other layer is under-annotation. HUMAnN2 relies on pre-existing mappings of genes -> reactions in order to assign reactions (and hence pathways) to species. As a large fraction of genes are unannotated, even if we quantify them, we may under-count pathways in which they participate. Gap-filling helps here as well (and is more akin to the “assign pathways to genomes” issues that BioCyc faces).
On the pathway size issue, you are right: the “fill up to one” reaction rule is more permissive for smaller pathways and more stringent for larger pathways (though I’ll clarify that HUMAnN2 only quantifies MetaCyc pathways with 4+ reactions, so your example ends up being a tad small, but the idea holds).
1 Like
Sorry, I should have said “a single reaction could be missing…” But I see you got the point.