How did i deal with possible PCR or optical duplicate reads in my metagenomic sample?
I am concerned that will remove too many reads, affecting abundance estimates. On the other
hand, if duplicates are left in, that also effects abundance estimates.
How should I deal with this problem?
Hello. I am also wondering whether duplicated reads reflect lib prep artifacts or a well covered microbiota. Does anyone have aclear idea on how to deal with those for read level profiling and assembly based appraoches?
We drop duplicates as part of our QC before any of the community profiling tools. That’s the standard in human wgs genomics (see this, for instance), and with modern illumina sequencers having a worst-case duplication rates of ~35% I don’t think you would be getting accurate abundances retaining them. We use clumpify which can be configured to detect and remove both PCR and optical duplicates. Dropping them also improves assembly if you are doing that downstream as well. From what I see, the arguments against dropping duplicates are (1) the time/memory it takes and (2) the assertion that the duplication should be evenly distributed across taxa and across genomic regions within taxa. I haven’t seen evidence of the latter, but there could be papers I’ve missed!
Thanks! I’m curious to hear any additional feedback. It seems that manuscript-based on Biobakery-friendly pipelines - i.e., using tools like Kneaddata and MAGs binning approaches such as TARA manual binning effort - typically do not account for duplicate removal, which leaves me quite uncertain.