Dear HUMAnN developers,
I am currently using HUMAnN v4.0.0.alpha together with MetaPhlAn vOct22 profiles and had a few technical questions regarding the internal filtering and coverage logic.
According to the documentation, HUMAnN appears to apply a minimum estimated coverage threshold of 0.5x during the prescreening step before organisms are retained for downstream analysis. However, while inspecting the log file from one sample run (attached), I also noticed filtering-related values associated with very small mapped read fractions.
For example, I observed values around ~0.02 or ~0.03 being retained, which made me wonder whether organisms around ~0.01 or lower are automatically excluded internally. I may be misunderstanding the interpretation of these values, so I would appreciate clarification.
My questions are:
-
Does HUMAnN internally exclude taxa below a certain MetaPhlAn relative abundance threshold (for example around 0.01%) before pangenome construction or translated search?
-
Are the “coverage” values used by HUMAnN directly inherited from MetaPhlAn SGB abundance/marker coverage estimates, or are they recalculated internally during gene-family/pathway mapping against the nucleotide and protein databases?
-
Is the reported “0.5x coverage” threshold based on:
-
MetaPhlAn marker genes coverage / estimated pangenome coverage,
-
pathway database coverage,
-
or another internal metric?
-
How reasonable is it to apply the same default 0.5x coverage threshold across shallow and deep shotgun sequencing datasets?
For example, in shallow sequencing datasets, could biologically relevant low-abundance organisms be disproportionately filtered compared to deeper datasets?
-
More generally, would it be technically reasonable to interpret HUMAnN primarily as a tool optimised for global functional profiling (gene families, pathways, reactions), while fine-scale quantitative interpretation of low-abundance SGB tracking should be approached more cautiously?
I am attaching the log file from one sample run for reference.
HUG_229_T1.humann.log.txt (257.8 KB)
Thank you very much for your help and for developing HUMAnN.
Kind regards,
Jaspreet
Filtering is being done on the estimated coverage values from MetaPhlAn 4. When run using the mode that HUMAnN 4 invokes, MetaPhlAn adds an extra column to its output reporting coverage in units of reads/nucleotide site. HUMAnN internally multiplies this by average read length to get to a more typical nucleotides/nucleotide site measure (requiring it to be >0.5x).
If you look at one of the smallest accepted abundances from your log file (0.05% of reads), we can quickly estimate the corresponding coverage as total reads * read length * relative abundance / genome length, which in this case is 25e6 * 150 * 0.0005 / 5e6 ~ 0.4x coverage. That’s close to the 0.5x threshold and suggests that my genome size estimate (5e6) is probably a bit high for this species. Alternatively, if you look at the coverage estimate for this species in your MetaPhlAn profile and multiply by average read length, my guess is that it’s right at the 0.5x border.
I think that answers your Qs 1-2. For Q3, MetaPhlAn gets its coverage estimate by directly measuring the coverage of a species’ marker genes and assuming that applies to the full genome (using the average genome length within the species to also estimate total reads explained, although that’s not used here).
To Q4, filtering on coverage is (IMO) the more sensible thing to do across deep vs. shallow datasets, hence the change to this approach in HUMAnN 4. Functional profiling requires reasonable coverage of a species’ genome, otherwise you aren’t really seeing the complete set of genes that it’s contributing. By filtering on % abundance (our previous approach), you’d overestimate your ability to profile moderately abundant species in shallow samples.
To Q5, HUMAnN is designed for functional profiling of reasonably well-covered species in metagenomes. As you can see in your example here, for a typical metagenome (10Ms of reads), even a fairly low-abundance species can have reasonable expected coverage. Conversely, I would say that more generally, functional profiling of very shallow samples (by any method) is tricky given that you only expect reads to cover the genomes of the most abundant members of a community in that regime.
1 Like