Using humann2_infer_taxonomy and/or uniref90-tol-lca.dat to assign taxonomy to UniRef90 hits

Hi HUMAnN2 team,
I’m working on assigning taxonomy to UniRef90 hits and would like to integrate into the HUMAnN2 environment for taxonomic assignments of hits. I have output from a diamond blastx vs UniRef90 run so UniRef90 IDs as well as a pipe character followed by a string, e.g., UniRef90_R5XQM2 and UniRef90_R5XQM2|1416. I tried joining the UniRef90_R5XQM2 on your uniref90-tol-lca.dat file but this offers pretty limited taxonomic information where I would expect something deeper. Is there a way to use your humann2_infer_taxonomy on diamond blastx vs UniRef90 runs?

The reason I’m not using the HUMAnN2 pipeline is that I’m working in an environment with a ton of novelty, so had to build my own gene catalogue and am checking annotations across several databases (UniRef, PFAM, KEGG etc.). I understand this is outside the scope of using the HUMAnN2 pipeline but was hoping you can offer some guidance nonetheless. Thank you!

Can you expand on the deeper taxonomic info you’re looking for? The tol-lca file gives the LCA for the UniRef family as defined in UniRef (i.e. the LCA of all species contributing a protein to the family).

Ah, ok, I misunderstood the file and that makes sense. I tried going through the huamnn2_infer_taxonomy script and it seemed like something more complicated was going on. So, a join on diamond blastx vs UniRef90 output by UniRef90 ID with the LCA file is sufficient for LCA taxonomic assignment?

Yes, that should allow you to add the NCBI TaxID for the LCA to your blast output table.