Why are some HUMAnN features in [name]_pathabundance.tsv missing taxonomy?

I’m using HUMAnN 3.8

I made a custom HUMAnN protein database using UniRef50 mappings. My command was the following:

INPUT=veba_output/preprocess/S4/output/joined.fastq.gz
OUTPUT=test_output
DMND_DB_DIR=veba_output/misc/diamond_database/
NUM_THREADS=1

humann --input ${INPUT} --output ${OUTPUT} --protein-database ${DMND_DB_DIR} --threads ${NUM_THREADS} --bypass-nucleotide-search --input-format fastq.gz --translated-identity-threshold 50 --translated-query-coverage-threshold 80 --search-mode uniref50 --id-mapping  veba_output/misc/humann_uniref_annotations.tsv

Everything ran as expected however, I’m not sure why the output is missing taxonomy when each protein contains the taxonomy info.

(base) [jespinoz@login02 TestVEBA]$ head veba_output/misc/humann_uniref_annotations.tsv
S1__NODE_166_length_33212_cov_10.244534_32964:33209(+)	UniRef50_Q41093	82	c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_229_length_28347_cov_10.467447_5060:5751(-)	UniRef50_Q41093	196	c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_298_length_24477_cov_10.374539_11070:11663(+)	UniRef50_Q41093	198	c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_56_length_52126_cov_10.433043_49957:50562(+)	UniRef50_Q41093	202	c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_56_length_52126_cov_10.433043_50725:52123(-)	UniRef50_Q41093	338	c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_743_length_9370_cov_9.617391_5:550(+)	UniRef50_Q41093	182	c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_14_length_79184_cov_10.324875_17207:17686(+)	UniRef50_A0A1Z5JXG7	160	c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_646_length_11435_cov_10.486731_5428:5952(-)	UniRef50_A0A1Z5JXG7	175	c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_674_length_10805_cov_9.362977_3661:4200(-)	UniRef50_A0A1Z5JXG7	180	c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_8_length_86326_cov_10.138192_57293:57772(+)	UniRef50_A0A1Z5JXG7	160	c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1

Check it out:

(base) [jespinoz@login02 TestVEBA]$ head -n 20 test_output/joined_pathabundance.tsv
# Pathway	joined_Abundance
UNMAPPED	280562.1981557097
UNINTEGRATED	503525.8301515482
UNINTEGRATED|c__Bacillariophyceae;o__Bacillariales;f__Bacillariaceae;g__;s__;t__S4__METABAT2__E.1__bin.2	557698.9739954222
UNINTEGRATED|c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1	22571.9607414017
UNINTEGRATED|c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S2__METABAT2__E.1__bin.2	22562.8012692604
UNINTEGRATED|d__Bacteria;p__Pseudomonadota;c__Alphaproteobacteria;o__Rhodobacterales;f__Rhodobacteraceae;g__Aliishimia;s__;t__S3__METABAT2__P.1__bin.4	9215.3696142461
UNINTEGRATED|d__Bacteria;p__Pseudomonadota;c__Alphaproteobacteria;o__Rhodobacterales;f__Rhodobacteraceae;g__JALEFY01;s__;t__S3__MAXBIN2-107__P.1__bin.005	820.2588789283
LIPASYN-PWY: phospholipases	240.5774367950
LIPASYN-PWY: phospholipases|c__Bacillariophyceae;o__Bacillariales;f__Bacillariaceae;g__;s__;t__S4__METABAT2__E.1__bin.2	196.3624498706
LIPASYN-PWY: phospholipases|c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1	22.1074934622
LIPASYN-PWY: phospholipases|c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S2__METABAT2__E.1__bin.2	22.1074934622
PANTO-PWY: phosphopantothenate biosynthesis I	137.1807208031
PWY-7208: superpathway of pyrimidine nucleobases salvage	133.8841119258
PWY-5667: CDP-diacylglycerol biosynthesis I	111.6144003857
PWY0-1319: CDP-diacylglycerol biosynthesis II	111.6144003857
P41-PWY: pyruvate fermentation to acetate and (S)-lactate I	102.5586344296
PWY-5100: pyruvate fermentation to acetate and lactate II	102.5586344296
PWY-5100: pyruvate fermentation to acetate and lactate II|c__Bacillariophyceae;o__Bacillariales;f__Bacillariaceae;g__;s__;t__S4__METABAT2__E.1__bin.2	83.8098433960
SER-GLYSYN-PWY: superpathway of L-serine and glycine biosynthesis I	97.7258146256

LIPASYN-PWY: phospholipases makes sense because it’s literally just the sum of the other LIPASYN-PWY: phospholipases that have taxonomy.

However, this one doesn’t have any taxonomy anywhere in the file:

PANTO-PWY: phosphopantothenate biosynthesis I	137.1807208031

Then there is this one:

(base) [jespinoz@login02 TestVEBA]$ cat test_output/joined_pathabundance.tsv | grep "PWY-3781"
PWY-3781: aerobic respiration I (cytochrome c)	36.1384257683
PWY-3781: aerobic respiration I (cytochrome c)|d__Bacteria;p__Pseudomonadota;c__Alphaproteobacteria;o__Rhodobacterales;f__Rhodobacteraceae;g__Aliishimia;s__;t__S3__METABAT2__P.1__bin.4	17.5489594619
PWY-3781: aerobic respiration I (cytochrome c)|d__Bacteria;p__Pseudomonadota;c__Alphaproteobacteria;o__Rhodobacterales;f__Rhodobacteraceae;g__JALEFY01;s__;t__S3__MAXBIN2-107__P.1__bin.005	2.3236692546

Where the ones with taxonomy don’t sum up to the one without taxonomy.

My main question is why some don’t have taxonomy assigned to them when all of the protein identifiers are linked to a specific taxa in the --id-mapping table.

Because we report the abundance of complete pathways, it’s possible/common for the community total to be larger than the sum of complete contributions from individual species. Consider a tiny pathway A → B. Species 1 contributes 2As and 1B, so it has one complete pathway copy. Species 2 contributes 1A and 2Bs, so it also has one complete copy. But at the community level there are three complete copies (not just two).

In the extreme case, no single species has a complete copy of the pathway, but the pathway is still satisfied at the community level (e.g. species 1 contributes only A and species 2 contributes only B). In this case, you will see a community (unstratified) abundance for the pathway but not species-level stratifications.

These phenomena are a consequence of the way that pathways are reconstructed compared to other functional units. For any other function profiled by HUMAnN (e.g. an EC), the strata will sum to the total.

1 Like