Understanding humann_infer_taxonomy


I am having troubles understanding what the command line arguments for the utility script humann_infer_taxonomy do.

I have read through the code of infer_taxonomy.py and the humann_infer_taxonomy documentation, but I do not get what the --mode {totals,unclassified,stratified} and the --lca-choice {source_tax,uniref_lca,humann_lca} control. What do the different options do?

And I also wonder, if there is a way to preserve existing species level taxonomic information from the pangenome search for a certain gene family, in both cases, so when unclassified is replaced by some level (e.g. family) and also when unclassified cannot be replaced?
In other words, is it possible to not modify features of known genus/species to match target level, but just to re-assign unclassified taxonomic gene families based on results from translated search – if that makes sense?

Here are the available command line options:

(humann3.6_metaphlan4_py3.9) bernhard@macbook ~ % humann_infer_taxonomy -h
usage: humann_infer_taxonomy [-h] -i INPUT [-o OUTPUT] [-l {Kingdom,Phylum,Class,Order,Family,Genus}] [-d {uniref50-tol-lca,uniref90-tol-lca}] [-m {totals,unclassified,stratified}]
                             [-c {source_tax,uniref_lca,humann_lca}] [-t THRESHOLD] [--devdb DEVDB]

HUMAnN utility for inferring "unclassified" taxonomy
Based on the lowest common ancestor (LCA) annotation
of each UniRef50/90 cluster, infer approximate taxonomy 
for unclassified features at a target level of resolution. 
Will modify features of known genus/species to match 
target level.

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        HUMAnN genefamilies table
  -o OUTPUT, --output OUTPUT
                        Destination for modified table; default=STDOUT
  -l {Kingdom,Phylum,Class,Order,Family,Genus}, --level {Kingdom,Phylum,Class,Order,Family,Genus}
                        Desired level for taxonomic estimation/summation; default=Family
  -d {uniref50-tol-lca,uniref90-tol-lca}, --database {uniref50-tol-lca,uniref90-tol-lca}
                        UniRef-specific taxonomy database
  -m {totals,unclassified,stratified}, --mode {totals,unclassified,stratified}
                        Which rows to include in the estimation/summation; default=totals
  -c {source_tax,uniref_lca,humann_lca}, --lca-choice {source_tax,uniref_lca,humann_lca}
                        Which per-gene taxonomic annotation to consider; default=humann_lca
  -t THRESHOLD, --threshold THRESHOLD
                        Minimum frequency for a new taxon to be included; default=1e-3
  --devdb DEVDB         Manually specify a development database

I am using humann v3.6 right now.
Best regards,

Allow me to clarify. --mode determines which features the script will act on:

  • “totals” means ignore stratification, treat the total abundance of a feature as coming from one taxon, and try to infer that taxon from the UniRef LCA.

  • “unclassified” means focus just on the unclassified abundance and try to assign it an approximate taxonomy based on the UniRef LCA. Other stratifications are ignored. This is the main way I use the script (i.e. to get a rough sense of which taxa might be the major contributors to the unclassified fraction).

  • “stratified” will bubble up the taxonomy of the species-stratified abundances to a given level while also trying to place the unclassified abundance at that level.

--lca-choice is something we added in HUMAnN 3 after noticing that UniRef LCAs are very conservative (e.g. a cluster with 100 E. coli proteins and one viral protein will be assigned “root of life” as LCA, although “E. coli” is a more sensible choice):

  • “humann_lca” uses a new set of LCAs we defined to handle cases like my example above (basically building in a certain “error tolerance” for sequences that disagree with the majority LCA). I would recommend using these.

  • “uniref_lca” uses the LCAs assigned by UniRef.

  • “source_tax” uses the taxonomy of the species from which the UniRef representative sequence hails instead of an LCA taxon for the cluster.

To your last question, we don’t have a mode that would allow you to write stratifications of mixed taxonomic resolution to the output (e.g. known species stay as species while unclassified bubbles up to family). This sort of approach can get messy if you aren’t super careful. Those risks aside, I suppose you could achieve something like this by running infer_taxonomy in “unclassified” mode and then weaving the refined unclassified levels back together with the original species stratified values?