I’m trying to use Humann3 with a custom version of uniref90 (nothing special, just a different release) but I keep getting empty results (0% unmapped and nothing else).
My workflow is a little unusual in that I am aligning reads to the database first then handing off the m8 file to humann2 to quantify pathways. Based on digging around through the docs/code it seems like I need an id-mapping table to tell humann2 how big the genes are, what species they belong to etc.
I have two questions:
- Is my understanding correct? I’ve tried this out with a small fake id-mapping table and it seems to work.
- Does such an id-mapping table already exist for uniref90? I’m not particularly concerned about the release version mismatch for this.
An id mapping file allows you to use HUMAnN with a truly custom database (with arbitrary sequence headers). Those headers would show up as targets in your m8 file, and the id mapping file would then allow you to associate the read mass they recruit with species + functions.
The alternative is to build a database whose sequence headers contain all that information, which in the case of HUMAnN’s built-in UniRef90 is just the UniRef90 ID + the DNA-equivalent sequence length (e.g.
>UniRef90_ABC|300 for a protein ABC with length 100 amino acids).
Makes sense. HUMAnN does not associate taxonomic info with UniRef entries by default right?
the id mapping file would then allow you to associate the read mass they recruit with species + functions.
To clarify this, are the gene lengths used in RPK calculations pulled from the “length” attribute of the sequence header/id mapping file, or are they determined during the alignment steps ie in the Bowtie2 alignment results “Column 9: Observed template length”?
The former (from the header or id mapping file). Though in the case of SAM output the two ought to agree.