Conntecting pathway abundance to contigs

Hello!

I hope you can help me. We have run humann2 on our samples and have gotten out the metacyc pathway information and pathway abundance. For another analysis, we would like to check how the pathways are distributed across the contigs, e.g. Pathway1: first 4 reactions are on contig1, last 3 reactions are on contig2. What would be the best way to do this?

My idea was to parse the metacyc_pathways_structured_filtered file the same way as humann2 does, so that the analyses stay consistent. Basically, I would like to know: from the metacyc_pathways_structured_filtered file, how does humann2 decide which combos of genes constitute a complete pathway?

examples:
METHGLYUT-PWY 1.1.1.283-RXN LACTALDDEHYDROG-RXN -L-LACTDEHYDROGFMN-RXN -RXN0-4281 -RXN-8632 GLYOXIII-RXN GLYOXI-RXN GLYOXII-RXN -DLACTDEHYDROGFAD-RXN
According to the code of the PathwaysDatabase class in store.py, that would be an unstructured pathway, right? So, for the unstructured pathways, does a pathway count as complete if all reactions are there that are not optional?

PWY-7431 ( -SYNEPHRINE-DEHYDRATASE-RXN , -RXN-15198 , -OCTOPAMINE-DEHYDRATASE-RXN , RXN-5821 ) -1.2.1.53-RXN RXN-8505 RXN6666-4 RXN6666-5

PWY-7433 2.4.1.41-RXN 2.4.1.122-RXN ( 2.4.99.4-RXN , ( 2.4.1.102-RXN 2.4.1.146-RXN ) )
For the structured pathways, I am a little more confused. I get how the boolean operators are used, I am just wondering how the optional reactions play into this. For PWY-7433, I would think that the pathway would count as complete if either: 2.4.1.41-RXN + 2.4.1.122-RXN + 2.4.99.4-RXN are present, or 2.4.1.41-RXN + 2.4.1.122-RXN + 2.4.1.102-RXN + 2.4.1.146-RXN are there, is that correct? But then, what does it mean for completeness when the reactions in the parentheses are marked as optional like in PWY 7431?

Thanks in advance!

A clarifying question: You mentioned running contigs through HUMAnN. Did you mean sequencing reads there, or are you actually running contigs (i.e. output from assembly of reads) through HUMAnN?

All of the pathways in the “structured” file are structured in the sense that we expect all their pieces to be satisfied, as opposed to (say) an unstructured gene list like a GO term.

When a pathway doesn’t have any ()s in it, like the METHGLYUT-PWY example, it just means that there are no (A or B) options to optimize over – in principle all of the reactions on that line are needed (though in that case a few reactions have - prefixes to indicate that they are optional).

HUMAnN2’s pathway abundance calculations boil down to finding the “weakest link” in the chain (i.e. reaction of lowest abundance). So any time there is an (A or B) choice, HUMAnN takes the option with higher abundance. Optional reactions are ignored by this optimization such that they aren’t allowed to be the weakest link.

1 Like

Thanks for your reply!
Whoops, I misread something in a report - it was run with the reads of course.

Thanks for answering the question, that does clear up a lot.