Need for human sequences in chocophlan database


As per this recent pre-print:

Is there a need to include human genome sequences (or other reference genomes) in the chocophlan database? I understand this approach works very differently than Kraken, but I wonder if even after attempting to filter out the human genome, the presence of human DNA may be affecting the taxonomic calls in Humann3.

A few thoughts here:

  1. We recommend depleting host reads upstream of analysis with MetaPhlAn and (more importantly) HUMAnN.

  2. The more complete the host genome used in decontamination, the more effective the decontamination will be.

  3. The (pan)genome databases used by MetaPhlAn and HUMAnN have been QC’ed to enrich for high-quality microbial genomes and MAGs, so we expect the issue of host contamination in the genomes to be reduced relative to genome collections that have not been QC’ed in this manner.

  4. MetaPhlAn’s approach to taxonomic profiling (which HUMAnN uses to pick pangenomes to map reads against) does not involve trying to bin individual reads by taxonomy, which results in increased specificity (fewer false positives).

  5. The step in our pipeline that is probably most vulnerable to host contamination is HUMAnN’s “fallback” to translated search, as this step tries to quantify functions among the unclassified reads in a community independent of taxonomy. Cryptic host reads derived from protein-coding regions could theoretically map at this stage, though we expect this to be rare (at least for the human genome, protein-coding material is both very well characterized and also rare as a % of total host contamination).

  6. Lastly, if you are working with highly host-contaminated samples, I recommend including the “fraction of host reads removed” as a variable during modeling. Any feature that correlates strongly with this variable has a strong chance of representing cryptic contamination itself and should thus be interpreted with caution.

Thanks for your reply!

Thank you for your thoughtful comments and considerations. I agree, the upfront depletion of human reads seems imperative. It is helpful to see how Humann3 differs in it’s approach and how it is already likely protected from the level of error in the original paper. I really like your idea of using the human/host reads as a covariate, that would definitely have eliminated a lot of the problems they encountered.