I was looking at the bowtie_aligned.tsv and diamond_aligned.tsv, against the genefamilies output table and I am a little confused on construction of the final table.
Maybe I’m making a few incorrect assumptions:
- bowtie_aligned.tsv is a formatted table of the samfile export of bowtie. It shows all of the hits, and the reads.
- diamond_aligned.tsv is a formatted table of the diamond output. it too shows all of the hits, and the reads. with the appropriate quality scores
- HUMAnN2/3 uses the percent id (90 and above) to filter the hits reported in both bowtie and diamond.
- HUMAnN2/3 will only consider (3) as the filtering criteria. It will allow reads to be mapped more than once.
I tried to mimic this process by using the assumptions above, but I am not getting the same number of gene families.
My Process is:
Take the 90+ percent identity hits from bowtie, and combine it with the 90+ percent identity hits from diamond.
(I’ve also noticed that there are some overlaps between the 2, in terms of reads annotated)
so my question is: How is the decision on what to export to the gene families output table made?