I’m using HUMAnN 3.8
I made a custom HUMAnN protein database using UniRef50 mappings. My command was the following:
INPUT=veba_output/preprocess/S4/output/joined.fastq.gz
OUTPUT=test_output
DMND_DB_DIR=veba_output/misc/diamond_database/
NUM_THREADS=1
humann --input ${INPUT} --output ${OUTPUT} --protein-database ${DMND_DB_DIR} --threads ${NUM_THREADS} --bypass-nucleotide-search --input-format fastq.gz --translated-identity-threshold 50 --translated-query-coverage-threshold 80 --search-mode uniref50 --id-mapping  veba_output/misc/humann_uniref_annotations.tsv
Everything ran as expected however, I’m not sure why the output is missing taxonomy when each protein contains the taxonomy info.
(base) [jespinoz@login02 TestVEBA]$ head veba_output/misc/humann_uniref_annotations.tsv
S1__NODE_166_length_33212_cov_10.244534_32964:33209(+)	UniRef50_Q41093	82	c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_229_length_28347_cov_10.467447_5060:5751(-)	UniRef50_Q41093	196	c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_298_length_24477_cov_10.374539_11070:11663(+)	UniRef50_Q41093	198	c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_56_length_52126_cov_10.433043_49957:50562(+)	UniRef50_Q41093	202	c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_56_length_52126_cov_10.433043_50725:52123(-)	UniRef50_Q41093	338	c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_743_length_9370_cov_9.617391_5:550(+)	UniRef50_Q41093	182	c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_14_length_79184_cov_10.324875_17207:17686(+)	UniRef50_A0A1Z5JXG7	160	c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_646_length_11435_cov_10.486731_5428:5952(-)	UniRef50_A0A1Z5JXG7	175	c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_674_length_10805_cov_9.362977_3661:4200(-)	UniRef50_A0A1Z5JXG7	180	c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_8_length_86326_cov_10.138192_57293:57772(+)	UniRef50_A0A1Z5JXG7	160	c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
Check it out:
(base) [jespinoz@login02 TestVEBA]$ head -n 20 test_output/joined_pathabundance.tsv
# Pathway	joined_Abundance
UNMAPPED	280562.1981557097
UNINTEGRATED	503525.8301515482
UNINTEGRATED|c__Bacillariophyceae;o__Bacillariales;f__Bacillariaceae;g__;s__;t__S4__METABAT2__E.1__bin.2	557698.9739954222
UNINTEGRATED|c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1	22571.9607414017
UNINTEGRATED|c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S2__METABAT2__E.1__bin.2	22562.8012692604
UNINTEGRATED|d__Bacteria;p__Pseudomonadota;c__Alphaproteobacteria;o__Rhodobacterales;f__Rhodobacteraceae;g__Aliishimia;s__;t__S3__METABAT2__P.1__bin.4	9215.3696142461
UNINTEGRATED|d__Bacteria;p__Pseudomonadota;c__Alphaproteobacteria;o__Rhodobacterales;f__Rhodobacteraceae;g__JALEFY01;s__;t__S3__MAXBIN2-107__P.1__bin.005	820.2588789283
LIPASYN-PWY: phospholipases	240.5774367950
LIPASYN-PWY: phospholipases|c__Bacillariophyceae;o__Bacillariales;f__Bacillariaceae;g__;s__;t__S4__METABAT2__E.1__bin.2	196.3624498706
LIPASYN-PWY: phospholipases|c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1	22.1074934622
LIPASYN-PWY: phospholipases|c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S2__METABAT2__E.1__bin.2	22.1074934622
PANTO-PWY: phosphopantothenate biosynthesis I	137.1807208031
PWY-7208: superpathway of pyrimidine nucleobases salvage	133.8841119258
PWY-5667: CDP-diacylglycerol biosynthesis I	111.6144003857
PWY0-1319: CDP-diacylglycerol biosynthesis II	111.6144003857
P41-PWY: pyruvate fermentation to acetate and (S)-lactate I	102.5586344296
PWY-5100: pyruvate fermentation to acetate and lactate II	102.5586344296
PWY-5100: pyruvate fermentation to acetate and lactate II|c__Bacillariophyceae;o__Bacillariales;f__Bacillariaceae;g__;s__;t__S4__METABAT2__E.1__bin.2	83.8098433960
SER-GLYSYN-PWY: superpathway of L-serine and glycine biosynthesis I	97.7258146256
LIPASYN-PWY: phospholipases makes sense because it’s literally just the sum of the other LIPASYN-PWY: phospholipases that have taxonomy.
However, this one doesn’t have any taxonomy anywhere in the file:
PANTO-PWY: phosphopantothenate biosynthesis I	137.1807208031
Then there is this one:
(base) [jespinoz@login02 TestVEBA]$ cat test_output/joined_pathabundance.tsv | grep "PWY-3781"
PWY-3781: aerobic respiration I (cytochrome c)	36.1384257683
PWY-3781: aerobic respiration I (cytochrome c)|d__Bacteria;p__Pseudomonadota;c__Alphaproteobacteria;o__Rhodobacterales;f__Rhodobacteraceae;g__Aliishimia;s__;t__S3__METABAT2__P.1__bin.4	17.5489594619
PWY-3781: aerobic respiration I (cytochrome c)|d__Bacteria;p__Pseudomonadota;c__Alphaproteobacteria;o__Rhodobacterales;f__Rhodobacteraceae;g__JALEFY01;s__;t__S3__MAXBIN2-107__P.1__bin.005	2.3236692546
Where the ones with taxonomy don’t sum up to the one without taxonomy.
My main question is why some don’t have taxonomy assigned to them when all of the protein identifiers are linked to a specific taxa in the --id-mapping table.