I have come across the following discussions:
- Rarefaction of sequences before MetaPhlan analysis
- (unanswered) Should I calculate beta diversity with or without rarefaction?
- Shotgun rarefactions - metagenomics (microbiome), MetaPhlAn2
Rarefy metagenome sequence data before MetaPhlAn3 analysis
and others, and the general consensus seems to be “yes, you must rarify if you have varying sampling depths”. In our own data we have found batch effects by sampling depth as well.
I’m trying to minimize batch effects, and therefore have to rerun each sample rarefied to some depth appropriate for the samples in the cohort being analyzed.
Would it be theoretically possible to make a post-hoc normalization utility where metaphlan/humann3 results tables could be normalized given a sequencing depth parameter? Simply scaling the counts by the number of reads is ineffective due to the contribution of many reads to a given pathway or taxon assignment. Essentially if metaphlan is giving us a probable count of a particular taxon given a set of reads, we really need the probable count of a particular taxon given a set of reads and the total number of reads in a sample.
It would be great to have some clear guidance in the manual about how to address this (very common) situation; otherwise any subsequent differential abundance analysis will just be detecting rare taxa that get observed with deeper sequencing. Advice about prevalence/abundance thresholds would be useful as well; again due to the contribution of many reads to taxonomy assignment it is not trivial to determine if a bug is below the limit of detection or not. (As compared to amplicon sequencing, where we can reasonably say "Even though we detect 2 copies of bug_x in sample A with depth of 1000 reads, and 0 copies of bug_x in sample B with depth of 100 reads, we cannot assume these are deferentially abundant because we did not sequence enough molecules in Sample B to detect the relative abundance of 1/500 in sample A.)
Thanks in advance!