Deciding how the genefamilies table is produced

Billy_Law · January 15, 2021, 5:11pm

Hi!
I was looking at the bowtie_aligned.tsv and diamond_aligned.tsv, against the genefamilies output table and I am a little confused on construction of the final table.

Maybe I’m making a few incorrect assumptions:

bowtie_aligned.tsv is a formatted table of the samfile export of bowtie. It shows all of the hits, and the reads.
diamond_aligned.tsv is a formatted table of the diamond output. it too shows all of the hits, and the reads. with the appropriate quality scores
HUMAnN2/3 uses the percent id (90 and above) to filter the hits reported in both bowtie and diamond.
HUMAnN2/3 will only consider (3) as the filtering criteria. It will allow reads to be mapped more than once.

I tried to mimic this process by using the assumptions above, but I am not getting the same number of gene families.
My Process is:
Take the 90+ percent identity hits from bowtie, and combine it with the 90+ percent identity hits from diamond.
(I’ve also noticed that there are some overlaps between the 2, in terms of reads annotated)

so my question is: How is the decision on what to export to the gene families output table made?

franzosa · January 25, 2021, 12:25am

One of the key steps that’s missing for your numbered list is database sequence coverage filtering: by default, HUMAnN will only report a gene family if >50% of its sites were hit by reads. This was a feature of translated search in HUMAnN 2 and applies to all search in HUMAnN 3. Thus, it’s possible for a read to have a confident hit to a gene in the raw output but not be included in the genefamilies file. This behavior can be tuned from the HUMAnN CLI if you’re OK losing some specificity (in exchange for boosting sensitivity to rare genes).

Billy_Law · January 25, 2021, 12:44am

Ahh, ok

So my method (as it stands) would give too many hits.
This database sequence coverage filtering criteria further limits the reporting.

My interim solution was letting HUMAnN do this for me, by only using reads that annotate to one of the genefamilies reported in the file.

Thanks!

Topic		Replies	Views
Humann3 output logs: total bugs HUMAnN	2	2206	January 4, 2021
HUMAnN3 minor updates HUMAnN	1	373	June 29, 2020
Guidance on UniRef database contents/filtering HUMAnN	3	1265	June 26, 2020
High unaligned percentage HUMAnN	6	414	August 5, 2021
No unclassified fraction when using humann2 with uniref100 HUMAnN	1	434	August 24, 2020

Deciding how the genefamilies table is produced

Related topics