Dear bioBakery help forum,
Thank you for the wonderful tools.
I am working on analysis on Humann3 output on metacyc pathway coverage. I encountered some issue on the coverage. This is the coverage of pathway: HISTSYN-PWY (L-histidine biosynthesis). Please see the table below. The table shows that coverage of this pathway and the reads of corresponding EC gene families. I am confused on the difference between different samples. For example, the sample 3A1493 has the fewer reads of each gene families than those of sample2A1086. However, the coverage of 3A1493 is 1 and the coverage of 2A1086 is 0.95.
Could you please give some instructions on how to calculate the coverage?
You can read more about pathway coverage here (or in the HUMAnN 1 paper):
The short answer is that coverage is based on where reactions fall within the abundance distribution of their sample, and hence a pathway X could be covered more highly in a sample with lower read depth if the X reactions were proportionally more abundant there.
That said, the coverage calculations are mostly a holdover from HUMAnN 1 (where they were more critical) and are a target for update/replacement in the future. I would focus on the pathway abundances instead; if a pathway has abundance > 0 you can consider that it was fairly confidently covered/detected by modern HUMAnN.
Thank you so much for your reply.
Thank you so much. This is very helpful. We are trying to set thresholds in our analysis - would you say the same for gene family (abundance > 0 → confidently covered/detected)? Thank you!
Yes, but for different reasons and with some caveats. A pathway being non-zero means that many reactions had to be non-zero (an abundance of evidence). The equivalent we use at the single-gene level is requiring >50% of sites to have been hit by a read (i.e. a single read hitting a gene is not sufficient to call the gene as present). This increases specificity relative to the “any hit = detected” approach, but it’s not perfect. For example, there are some gene families that are very hard to tell apart on the basis of short-read recruitment (e.g. an A-domain family + B-domain family vs. an AB-domain family).
Thank you so much! This is very helpful.