Deciding how the genefamilies table is produced

I was looking at the bowtie_aligned.tsv and diamond_aligned.tsv, against the genefamilies output table and I am a little confused on construction of the final table.

Maybe I’m making a few incorrect assumptions:

  1. bowtie_aligned.tsv is a formatted table of the samfile export of bowtie. It shows all of the hits, and the reads.
  2. diamond_aligned.tsv is a formatted table of the diamond output. it too shows all of the hits, and the reads. with the appropriate quality scores
  3. HUMAnN2/3 uses the percent id (90 and above) to filter the hits reported in both bowtie and diamond.
  4. HUMAnN2/3 will only consider (3) as the filtering criteria. It will allow reads to be mapped more than once.

I tried to mimic this process by using the assumptions above, but I am not getting the same number of gene families.
My Process is:
Take the 90+ percent identity hits from bowtie, and combine it with the 90+ percent identity hits from diamond.
(I’ve also noticed that there are some overlaps between the 2, in terms of reads annotated)

so my question is: How is the decision on what to export to the gene families output table made?

One of the key steps that’s missing for your numbered list is database sequence coverage filtering: by default, HUMAnN will only report a gene family if >50% of its sites were hit by reads. This was a feature of translated search in HUMAnN 2 and applies to all search in HUMAnN 3. Thus, it’s possible for a read to have a confident hit to a gene in the raw output but not be included in the genefamilies file. This behavior can be tuned from the HUMAnN CLI if you’re OK losing some specificity (in exchange for boosting sensitivity to rare genes).

Ahh, ok

So my method (as it stands) would give too many hits.
This database sequence coverage filtering criteria further limits the reporting.

My interim solution was letting HUMAnN do this for me, by only using reads that annotate to one of the genefamilies reported in the file.