Not detecting known pathways, even in synthetic high depth data from a single species

Hi,

Thanks for your work on developing HUMAnN3, it’s a very useful tool. I’ve been having a problem that unfortunately has persisted to the current version related to detection of pathways known to present in a given sample. For example, I cannot detect the reductive acetyl-CoA synthesis pathway (Wood-Ljungdahl) in any sample I’ve looked at to date.

Databases: full Chocophlan database (full_chocophlan.v296_201901), Uniref90 full (uniref90_annotated_v201901)
HUMAaN: biobakery/humann:3.0.0.a.4.plus.utility.dbs
Metaphlan: 3.0

What I expect: if provided wgsim-generated synthetic data derived from an acetogen (https://www.ncbi.nlm.nih.gov/assembly/GCF_000013105.1, Moorella thermoacetica), I expect to detect CODH-PWY (https://metacyc.org/META/NEW-IMAGE?type=PATHWAY&object=CODH-PWY) or another equivalent pathway

What happens: CODH-PWY is not detected.

Other information: I checked the metacyc pathway information included with HUMAaN3 and noticed this pathway/species association in the metacyc_pathways_to_organisms file:

CODH-PWY Arabidopsis thaliana,Carthamus tinctorius,Glycine max,Limnanthes douglasii,Pisum sativum,Ricinus communis,Saccharomyces cerevisiae

Since I don’t know the intent of the file, I’m not sure if this is an issue, but these species do not match those listed on Metacyc: Acetitomaculum ruminis, Acetobacterium carbinolicum, Acetobacterium woodii, Blautia producta, Clostridium formicaceticum, Eubacterium limosum, Moorella thermoacetica, Moorella thermoautotrophica, Sporomusa malonica, Sporomusa termitida, Syntrophococcus sucromutans, [Butyribacterium] methylotrophicum

Let me know if you need any additional information to help debug this issue, or if there are settings I need to change/databases I need to obtain to get my expected result.

Log file if it helps:

humann_program.txt (48.3 KB)

Any insight into this problem? I unfortunately haven’t had time to look deeper.

Hello, Thank you for the detailed post and including the log. I doubled checked and also see the pathway you mentioned in the database file used by HUMAnN v3. This pathway would only be present in the output file if the associated reactions (and their respective unirefs) are identified from the alignments. It looks like from the log your run ran through the alignments very quickly (in less then 10 seconds for bowtie2) and even though you are using 28 threads this is likely very fast for the average input file. Is it possible you are running with a sub-sampled test file? If so this could result in less alignments and less pathways identified.

The pathways to organisms file is not used for the pathways calculations. My guess is that the difference in the Metacyc version is the reason the species differ for that pathway from those that are currently listed on the Metacyc site.

Thank you,
Lauren

Thanks for the reply Lauren. I’m re-ran the analysis using wgsim-generated synthetic data (~5M reads) from Moorella thermoacetica, one of the canonical Wood-Ljungdahl pathway using acetogens referenced on Metacyc. No CODH-PWY or similar detected, so I’m pretty puzzled at this moment.

Is there any HUMAaN3 debug information that could help me understand why the pathway isn’t being called?

Hi, Thanks for the additional information. I see the species you mentioned is the only one picked up for the custom ChocoPhlAn database creation which explains why that portion of the workflow run is so quick. If you want to dig into the pathway of interest a bit more you could look into the gene families associated with the pathway of interest to see which are included and which are not currently being identified.

The pathway equation is:

CODH-PWY	 FORMATE-DEHYDROGENASE-NADP+-RXN ( ( ( ( METHYLENETHFDEHYDROG-NADP-RXN ( 1.5.1.20-RXN , RXN-5061 ) METHCOCLTH-RXN ) , METHENYLTHFCYCLOHYDRO-RXN , 1.2.7.4-RXN ) ACETYLSYNCLTH-RXN ) , FORMATETHFLIG-RXN ) 

This file contains the reaction to gene family mappings: metacyc_reactions_level4ec_only.uniref.bz2.
The format of the file is tab-delimited columns of reaction, EC, followed by the associated unirefs (gene families).

Also looking at the log you posted ~5k nucleotide alignments are filtered based on filtered genes (genes without enough query coverage) and ~15k translated alignments are filtered based on small query coverage (alignments where only a small portion of the read aligns). You might consider looking at the error rate and quality score settings for the synthetic data generation to see if there is anything you might change as these settings will likely impact the number and quality of the alignments.

Thank you,
Lauren

Thank you! I’ll take a look to see what I find.