Dear bioBakery forum, hello there!
I decided to create a new topic to share a clean, detailed report of my experience — hoping it might help other researchers navigating the same issue.
Aim
I was looking for a way to connect Pathways to UniRef90 Gene Families IDs, and UniRef90 Gene Families IDs to Gene Families, with the ultimate goal of linking Pathways directly to Gene Families.
Issue
Although HUMAnN provides mapping files to link UniRef90 gene families IDs to gene families (e.g., map_$db_uniref90.txt
files found in the utility_mapping
folder, downloadable via:
humann_databases --download utility_mapping full $DIR
), I noticed there was no file that directly links UniRef90 Gene Families IDs to Pathways.
I found a 2019 post suggesting the use of the humann2_unpack_pathways
command. Since I am using HUMAnN v3.9, I looked for the updated tool and found humann_unpack_pathways
. I ran it as follows:
humann_unpack_pathways \
--input-genes humann_genefamilies.tsv \
--input-pathways humann_pathabundance.tsv \
--remove-taxonomy \
--output humann_unpacked.tsv
Hint
As noted in its tool description, humann_unpack_pathways
only outputs pathways for which UniRef90 Gene Families IDs are present in your humann_genefamilies.tsv
. This means some pathways may be missing if no corresponding UniRef90 Gene Families IDs were detected.
Related Work
A similar question was raised in this 2021 forum post, but as of today, the referenced files are no longer available and the shared links are broken.
Workaround
I developed a method to link UniRef90 Gene Families IDs to pathways, and even trace back individual gene families by passing through UniRef90 Gene Families IDs. This fulfilled my aim of linking Pathways → UniRef90 Gene Families IDs → Gene Families.
What You’ll Need
-
humann_unpacked.tsv
(fromhumann_unpack_pathways
) -
A list of pathways you’re interested in
→ e.g.,humann_pathabundance_stratified.tsv
, or a custom subset (I usedsignificant_pathab.rds
) -
The following
utility_mapping
files:map_ko_uniref90.txt
map_go_uniref90.txt
map_pfam_uniref90.txt
map_eggnog_uniref90.txt
map_level4ec_uniref90.txt
Note: There’s no mapping file for MetaCyc reactions directly. However, because they share EC numbers, you can repurpose
map_level4ec_uniref90.txt
to map MetaCyc reactions as well.
The Function
Here’s the R function I wrote to perform the full linkage and output CSV files connecting pathways, UniRef90 gene families IDs, and functional gene family annotations.
# Load required libraries
library(dplyr)
library(readr)
library(stringr)
library(tidyr)
link_pathways_to_genefamilies <- function(
base_dir,
sig_pathab_rds = "results/significant_pathab.rds",
mapping_names = c("KO", "GO", "Pfam", "EggNOG", "Level4EC"),
mapping_prefix = "map_",
uniref_suffix = "_uniref90.txt"
) {
humann_unpacked_file <- file.path(base_dir, "humann_unpacked.tsv")
map_dir <- file.path(base_dir, "utility_mapping")
results_folder <- file.path(base_dir, "results")
sig_pathab_rds <- file.path(base_dir, sig_pathab_rds)
if (!file.exists(sig_pathab_rds)) stop("Significant pathway RDS not found: ", sig_pathab_rds)
load(sig_pathab_rds) # should load object `sp_pathab`
if (!exists("sp_pathab")) stop("Object `sp_pathab` not found.")
sig_pathways <- sp_pathab %>%
mutate(PathwayID = str_extract(GeneFamily, "^[^:]+")) %>%
distinct(PathwayID)
if (!file.exists(humann_unpacked_file)) stop("HUMAnN unpacked file not found.")
humann_unpacked <- read_tsv(
humann_unpacked_file,
col_types = cols_only(`# Pathway` = col_character()),
col_names = TRUE
) %>% rename(PathwayRaw = `# Pathway`)
pathway_gene_pairs <- humann_unpacked %>%
mutate(
PathwayID = str_extract(PathwayRaw, "^[^:|]+"),
GeneFamily = if_else(
str_detect(PathwayRaw, "\\|"),
str_split(PathwayRaw, "\\|", simplify = TRUE)[, 2],
NA_character_
)
) %>%
filter(PathwayID %in% sig_pathways$PathwayID) %>%
filter(!is.na(GeneFamily)) %>%
select(PathwayID, GeneFamily)
write_csv(pathway_gene_pairs, file.path(results_folder, "significant_pathway_gene_pairs.csv"))
read_mapping <- function(path) {
gf <- unique(pathway_gene_pairs$GeneFamily)
read_tsv(path, col_names = FALSE, col_types = cols(.default = "c")) %>%
pivot_longer(cols = -X1, names_to = NULL, values_to = "GeneFamily") %>%
filter(GeneFamily %in% gf) %>%
select(FunctionID = X1, GeneFamily) %>%
filter(!is.na(GeneFamily))
}
for (name in mapping_names) {
map_file <- file.path(map_dir, paste0(mapping_prefix, tolower(name), uniref_suffix))
if (!file.exists(map_file)) {
warning("Mapping file missing: ", map_file)
next
}
map_long <- read_mapping(map_file)
joined <- pathway_gene_pairs %>%
inner_join(map_long, by = "GeneFamily")
out_file <- file.path(results_folder, sprintf("significant_pathways_genefamily_%s_links2.csv", tolower(name)))
write_csv(joined, out_file)
cat(sprintf("✔ %s: %d UniRef90 gene family ID–%s links saved to %s\n",
name, nrow(joined), name, basename(out_file)))
rm(map_long, joined)
gc()
}
cat("Summary:\n")
cat(" • Significant pathways:", nrow(sig_pathways), "\n")
cat(" • Pathway–UniRef90 gene family ID pairs:", nrow(pathway_gene_pairs), "\n")
}
# Example usage
link_pathways_to_genefamilies(
base_dir = "$your_results_folder"
)
Hope this helps — and happy science to everyone!
– Chiara