Not detecting known pathways, even in synthetic high depth data from a single species

jdwinkler_lanzatech · August 28, 2020, 8:29pm

Hi,

Thanks for your work on developing HUMAnN3, it’s a very useful tool. I’ve been having a problem that unfortunately has persisted to the current version related to detection of pathways known to present in a given sample. For example, I cannot detect the reductive acetyl-CoA synthesis pathway (Wood-Ljungdahl) in any sample I’ve looked at to date.

Databases: full Chocophlan database (full_chocophlan.v296_201901), Uniref90 full (uniref90_annotated_v201901)
HUMAaN: biobakery/humann:3.0.0.a.4.plus.utility.dbs
Metaphlan: 3.0

What I expect: if provided wgsim-generated synthetic data derived from an acetogen (https://www.ncbi.nlm.nih.gov/assembly/GCF_000013105.1, Moorella thermoacetica), I expect to detect CODH-PWY (https://metacyc.org/META/NEW-IMAGE?type=PATHWAY&object=CODH-PWY) or another equivalent pathway

What happens: CODH-PWY is not detected.

Other information: I checked the metacyc pathway information included with HUMAaN3 and noticed this pathway/species association in the metacyc_pathways_to_organisms file:

CODH-PWY Arabidopsis thaliana,Carthamus tinctorius,Glycine max,Limnanthes douglasii,Pisum sativum,Ricinus communis,Saccharomyces cerevisiae

Since I don’t know the intent of the file, I’m not sure if this is an issue, but these species do not match those listed on Metacyc: Acetitomaculum ruminis, Acetobacterium carbinolicum, Acetobacterium woodii, Blautia producta, Clostridium formicaceticum, Eubacterium limosum, Moorella thermoacetica, Moorella thermoautotrophica, Sporomusa malonica, Sporomusa termitida, Syntrophococcus sucromutans, [Butyribacterium] methylotrophicum

Let me know if you need any additional information to help debug this issue, or if there are settings I need to change/databases I need to obtain to get my expected result.

jdwinkler_lanzatech · August 28, 2020, 9:21pm

Log file if it helps:

humann_program.txt (48.3 KB)

jdwinkler_lanzatech · September 8, 2020, 8:51pm

Any insight into this problem? I unfortunately haven’t had time to look deeper.

lauren.j.mciver · September 9, 2020, 7:20pm

Hello, Thank you for the detailed post and including the log. I doubled checked and also see the pathway you mentioned in the database file used by HUMAnN v3. This pathway would only be present in the output file if the associated reactions (and their respective unirefs) are identified from the alignments. It looks like from the log your run ran through the alignments very quickly (in less then 10 seconds for bowtie2) and even though you are using 28 threads this is likely very fast for the average input file. Is it possible you are running with a sub-sampled test file? If so this could result in less alignments and less pathways identified.

The pathways to organisms file is not used for the pathways calculations. My guess is that the difference in the Metacyc version is the reason the species differ for that pathway from those that are currently listed on the Metacyc site.

Thank you,
Lauren

jdwinkler_lanzatech · September 9, 2020, 8:15pm

Thanks for the reply Lauren. I’m re-ran the analysis using wgsim-generated synthetic data (~5M reads) from Moorella thermoacetica, one of the canonical Wood-Ljungdahl pathway using acetogens referenced on Metacyc. No CODH-PWY or similar detected, so I’m pretty puzzled at this moment.

Is there any HUMAaN3 debug information that could help me understand why the pathway isn’t being called?

lauren.j.mciver · September 10, 2020, 7:39pm

Hi, Thanks for the additional information. I see the species you mentioned is the only one picked up for the custom ChocoPhlAn database creation which explains why that portion of the workflow run is so quick. If you want to dig into the pathway of interest a bit more you could look into the gene families associated with the pathway of interest to see which are included and which are not currently being identified.

The pathway equation is:

CODH-PWY	 FORMATE-DEHYDROGENASE-NADP+-RXN ( ( ( ( METHYLENETHFDEHYDROG-NADP-RXN ( 1.5.1.20-RXN , RXN-5061 ) METHCOCLTH-RXN ) , METHENYLTHFCYCLOHYDRO-RXN , 1.2.7.4-RXN ) ACETYLSYNCLTH-RXN ) , FORMATETHFLIG-RXN )

This file contains the reaction to gene family mappings: metacyc_reactions_level4ec_only.uniref.bz2.
The format of the file is tab-delimited columns of reaction, EC, followed by the associated unirefs (gene families).

Also looking at the log you posted ~5k nucleotide alignments are filtered based on filtered genes (genes without enough query coverage) and ~15k translated alignments are filtered based on small query coverage (alignments where only a small portion of the read aligns). You might consider looking at the error rate and quality score settings for the synthetic data generation to see if there is anything you might change as these settings will likely impact the number and quality of the alignments.

Thank you,
Lauren

jdwinkler_lanzatech · September 17, 2020, 10:23pm

Thank you! I’ll take a look to see what I find.

Topic		Replies	Views
Mapping MetaCyc Pathways HUMAnN	5	817	December 15, 2022
Identified engineered pathway from human gut microbiome data HUMAnN	1	296	April 14, 2023
Detecting high abundance of mammalian MetaCyc pathways HUMAnN	4	646	January 9, 2020
No species identified and empty pathway files created HUMAnN	7	644	August 2, 2024
Pathways in Result But Not in MetaCyc Database HUMAnN	2	391	June 9, 2021

Not detecting known pathways, even in synthetic high depth data from a single species

Related topics