No pathways identified after HUMAnN

Hi,

I have shotgun metagenomic data and ran HUMAnN on it. After tweaking with available protein and nucleotide databases, I was able to get “Gene families and their abundance” table showing a lot identified genes (code mentioned at the end). However, “pathways and their abundance” table does not show any pathways (screenshot attached).

My questions:

  1. Is it normal to have only gene families and no pathways? Should I fix something?
  2. Is it reasonable to take only gene families and go through with the downstream analyses?
  3. Is it possible to use the gene families (GO:terms and their abudances) and construct the pathways using some other tools?

code for humann:

humann --input '/data/dnb09/galaxy_db/files/3/8/b/dataset_38b82fd0-9435-4483-9b72-870fb08df232.dat' --input-format fasta -o 'output' --bypass-prescreen  --nucleotide-database '/data/db/data_managers/humann/data/nucleotide_database/chocophlan-full-3.6.0-29032023' --nucleotide-identity-threshold 0.0 --nucleotide-subject-coverage-threshold 50.0 --nucleotide-query-coverage-threshold 90.0   --translated-alignment 'diamond' --protein-database '/data/db/data_managers/humann/data/protein_database/uniref-uniref90_diamond-3.0.0-13052021' --search-mode 'uniref90' --evalue 1.0 --translated-subject-coverage-threshold 50.0 --translated-query-coverage-threshold 90.0  --gap-fill 'on' --minpath 'on' --pathways 'metacyc' --xipe 'off' --annotation-gene-index 3 --log-level 'DEBUG' --o-log '/data/jwd05e/main/067/779/67779456/outputs/dataset_f849d124-f8cf-4bd9-93af-440aa2e0f1f7.dat' --output-basename 'humann' --output-format 'tsv' --output-max-decimals 10   --threads "${GALAXY_SLOTS:-4}" --memory-use minimum

Thanks,
Hussnain

Are you using HUMAnN’s default databases?

It is surprising to see no pathways, that is true. What environment is the community derived from? What sort of sequencing depth are you working with? Even if the gene output is non-empty, are there a lot of genes present (100Ks) or something smaller?

Thanks for the response,

I am using UniRef90 protein database and MetaCyc for computing pathways. I have UniPathways but it did not yield any pathways either.

I am working with Illumina shotgun sequencing data from cattle uterus. I don’t know what do you mean by sequencing depth and how can I find that out.

Yes, I get a lot of genes in the gene families table (screenshot attached


) and they can be renamed using Gene Ontology (which I am using right now). I am just trying to do enrichment based on the GO terms output and try to construct pathways (perhaps).

By sequencing depth I meant the number of sequencing reads in a single sample. For shotgun metagenomes it is usually on the order of 10s of millions. Is that what you’re working with here?

If I understand your screenshot, you are seeing ~4K genes, which is not a lot in the grand scheme of things (each species usually has a few K). It seems like you’re just not recovering a lot of gene diversity, either because of low read depth or because the sample isn’t well represented in our database. You can potentially try switching to UniRef50 to improve the latter problem.

Thanks Eric,

I just checked the depth of few of my samples and I have more than around 9-12 million sequencing reads in one sample (forward reads). I will also try using UniRef50 (2023) now.

That should be plenty of depth. If the issue is remoteness of homology, UniRef50 will help. You can also confirm that the reads have been properly QC’ed. Sometimes technical sequences left on the reads can prevent mapping.

Thank you. So, I tried UniRef50 and now I have fewer gene families but do find few pathways. The screenshot summary for both is attached here. Also, I checked the QC of my sample after removing host sequences and it does not seem to have any problems (summary attached).

I am working on EU server where I have problems with Chocophlan database. So, all the output I am sharing is a result of --bypass-prescreen input.

Thanks again for your help. And apologies if my questions are too stupid.