Connect Pathways to Uniref90 Gene Families IDs to Gene Families

Dear bioBakery forum, hello there!

I decided to create a new topic to share a clean, detailed report of my experience — hoping it might help other researchers navigating the same issue.


:bullseye: Aim

I was looking for a way to connect Pathways to UniRef90 Gene Families IDs, and UniRef90 Gene Families IDs to Gene Families, with the ultimate goal of linking Pathways directly to Gene Families.


:warning: Issue

Although HUMAnN provides mapping files to link UniRef90 gene families IDs to gene families (e.g., map_$db_uniref90.txt files found in the utility_mapping folder, downloadable via:

humann_databases --download utility_mapping full $DIR

), I noticed there was no file that directly links UniRef90 Gene Families IDs to Pathways.

I found a 2019 post suggesting the use of the humann2_unpack_pathways command. Since I am using HUMAnN v3.9, I looked for the updated tool and found humann_unpack_pathways. I ran it as follows:

humann_unpack_pathways \
  --input-genes humann_genefamilies.tsv \
  --input-pathways humann_pathabundance.tsv \
  --remove-taxonomy \
  --output humann_unpacked.tsv

:light_bulb: Hint

As noted in its tool description, humann_unpack_pathways only outputs pathways for which UniRef90 Gene Families IDs are present in your humann_genefamilies.tsv. This means some pathways may be missing if no corresponding UniRef90 Gene Families IDs were detected.


:books: Related Work

A similar question was raised in this 2021 forum post, but as of today, the referenced files are no longer available and the shared links are broken.


:white_check_mark: Workaround

I developed a method to link UniRef90 Gene Families IDs to pathways, and even trace back individual gene families by passing through UniRef90 Gene Families IDs. This fulfilled my aim of linking Pathways → UniRef90 Gene Families IDs → Gene Families.


:receipt: What You’ll Need

  1. humann_unpacked.tsv (from humann_unpack_pathways)

  2. A list of pathways you’re interested in
    → e.g., humann_pathabundance_stratified.tsv, or a custom subset (I used significant_pathab.rds)

  3. The following utility_mapping files:

    • map_ko_uniref90.txt
    • map_go_uniref90.txt
    • map_pfam_uniref90.txt
    • map_eggnog_uniref90.txt
    • map_level4ec_uniref90.txt

:light_bulb: Note: There’s no mapping file for MetaCyc reactions directly. However, because they share EC numbers, you can repurpose map_level4ec_uniref90.txt to map MetaCyc reactions as well.


:repeat_button: The Function

Here’s the R function I wrote to perform the full linkage and output CSV files connecting pathways, UniRef90 gene families IDs, and functional gene family annotations.

# Load required libraries
library(dplyr)
library(readr)
library(stringr)
library(tidyr)

link_pathways_to_genefamilies <- function(
    base_dir,
    sig_pathab_rds = "results/significant_pathab.rds",
    mapping_names = c("KO", "GO", "Pfam", "EggNOG", "Level4EC"),
    mapping_prefix = "map_",
    uniref_suffix = "_uniref90.txt"
) {
  humann_unpacked_file <- file.path(base_dir, "humann_unpacked.tsv")
  map_dir <- file.path(base_dir, "utility_mapping")
  results_folder <- file.path(base_dir, "results")
  sig_pathab_rds <- file.path(base_dir, sig_pathab_rds)
  
  if (!file.exists(sig_pathab_rds)) stop("Significant pathway RDS not found: ", sig_pathab_rds)
  load(sig_pathab_rds)  # should load object `sp_pathab`
  if (!exists("sp_pathab")) stop("Object `sp_pathab` not found.")
  
  sig_pathways <- sp_pathab %>%
    mutate(PathwayID = str_extract(GeneFamily, "^[^:]+")) %>%
    distinct(PathwayID)
  
  if (!file.exists(humann_unpacked_file)) stop("HUMAnN unpacked file not found.")
  humann_unpacked <- read_tsv(
    humann_unpacked_file,
    col_types = cols_only(`# Pathway` = col_character()),
    col_names = TRUE
  ) %>% rename(PathwayRaw = `# Pathway`)
  
  pathway_gene_pairs <- humann_unpacked %>%
    mutate(
      PathwayID = str_extract(PathwayRaw, "^[^:|]+"),
      GeneFamily = if_else(
        str_detect(PathwayRaw, "\\|"),
        str_split(PathwayRaw, "\\|", simplify = TRUE)[, 2],
        NA_character_
      )
    ) %>%
    filter(PathwayID %in% sig_pathways$PathwayID) %>%
    filter(!is.na(GeneFamily)) %>%
    select(PathwayID, GeneFamily)
  
  write_csv(pathway_gene_pairs, file.path(results_folder, "significant_pathway_gene_pairs.csv"))
  
  read_mapping <- function(path) {
    gf <- unique(pathway_gene_pairs$GeneFamily)
    read_tsv(path, col_names = FALSE, col_types = cols(.default = "c")) %>%
      pivot_longer(cols = -X1, names_to = NULL, values_to = "GeneFamily") %>%
      filter(GeneFamily %in% gf) %>%
      select(FunctionID = X1, GeneFamily) %>%
      filter(!is.na(GeneFamily))
  }

  for (name in mapping_names) {
    map_file <- file.path(map_dir, paste0(mapping_prefix, tolower(name), uniref_suffix))
    if (!file.exists(map_file)) {
      warning("Mapping file missing: ", map_file)
      next
    }

    map_long <- read_mapping(map_file)
    joined <- pathway_gene_pairs %>%
      inner_join(map_long, by = "GeneFamily")

    out_file <- file.path(results_folder, sprintf("significant_pathways_genefamily_%s_links2.csv", tolower(name)))
    write_csv(joined, out_file)

    cat(sprintf("✔ %s: %d UniRef90 gene family ID–%s links saved to %s\n",
                name, nrow(joined), name, basename(out_file)))

    rm(map_long, joined)
    gc()
  }

  cat("Summary:\n")
  cat("  • Significant pathways:", nrow(sig_pathways), "\n")
  cat("  • Pathway–UniRef90 gene family ID pairs:", nrow(pathway_gene_pairs), "\n")
}

# Example usage
link_pathways_to_genefamilies(
  base_dir = "$your_results_folder"
)

Hope this helps — and happy science to everyone!

– Chiara

@franzosa I kindly request your intervention under this thread, to make sure the procedure I adopted is correct.

In fact, I am currently facing an discreancy that creates an issue in my analysis: what happens is that after using humann_unpack_pathways, using the output to connect Gene Families to Pathways, I have 27 Pathways of interest and only 5 of them are backed up by Gene Families.

Does it seems right on you end?

Thanks in advance

I haven’t reviewed the entirely workflow here, but - if you’re working with HUMAnN 3.x - the UniRef90 to reaction map is broader than the UniRef90 to EC map, as we were also taking direct associations between UniRef90s and RXNs provided by MetaCyc. Maybe that’s the source of your discrepancy?

In general, a HUMAnN run uses two files for pathway quantification, a map from genes to reactions (“map1”) and a (structured) map from reactions to pathways (“map2”). These should be listed in your log file. You should be able to take the pathways from your pathway abundance profile, look up their reactions in map2, look up the contributing genes from map1, and then subset to those genes detected from your sample. This is essentially what HUMAnN is doing (in reverse) to go from genes → reactions → pathways.