Hi! I have two metagenomics samples with 600-700 million reads each (2x150) that I’ve been trying to run through the HUMAnN pipeline. Unfortunately, they aren’t able to complete even using the new HUMAnN 4 option, because they are so large - I’m limited to 7 days and they keep timing out.
However, if I split the data into 10 files I can run each of them independently and with HUMAnN 4 they can complete analysis (they were still timing out with HUMAnN 3, I think because they are environmental and still had about 97% of the reads remaining after nucleotide alignment).
My question is, is there a way to merge the output for those ten sub-samples back into the way a single sample would be reported? I was trying to understand the exact normalization methods used by HUMAnN in hopes of reversing them, but I don’t want to mess things up if I’m not understanding it correctly… I also think different features would need to be merged in different ways (abundance vs adjusted CPMs, for example).
For the CPMs in the gene families file I was thinking I could take the average across the 10 sub-samples, since they’re already normalized by read depth and gene length, while for abundance (as in the path abundance file) I could just sum the values for each pathway. I’m at a loss for the relative abundance and coverage columns from the Metaphlan profile, though, and I’m not sure what metric is being reported for the reactions file.
Has anyone done this before, or have suggestions for how this could be done accurately?
Thank you so much!
Those are some deeply-sequenced samples!
For HUMAnN 4, genes, reactions, and pathways are all reported in (already-normalized) CPM units. So if you’ve analyzed your reads divided up into 10 subsamples, the right way to combine the separate CPMs is by averaging the subsample profiles. This treats each subsample as a separate estimator of the truth, and the averaging will reduce signal-to-noise ratio (similar to starting with a larger N, which is what you’d be doing if you could analyze all reads at once). The one sticky point in this is that very rare (low coverage) features may drop below our detection thresholds in some subsamples but not others, driving their abundances to 0 in those cases. This will result in errors of underestimation for features in the range of 0.5x coverage upon averaging.
Likewise, for MetaPhlAn, you would average the relative abundances when analyzing by subsampling (following the same logic above). The one place where you’d want to sum rather than average would be the coverage measure, since you’d want to estimate the total coverage of each taxon in the original sample, not the average coverage across the subsamples.
Yeah - the core facility didn’t have enough samples to fill the flow cell so we ended up with far more reads than we had requested. It’s not a bad problem to have though!
Thank you for this explanation - it is super helpful and gives me a good way to move forward. I was concerned about the threshold as well but I will just make a note of it so we can avoid making possibly skewed conclusions about low coverage features.
I should also clarify that the kind of features that would be subject to the error I described would be only borderline detectable in a conventional metagenome or metatranscriptome (10Ms of reads), so kind of a niche problem here. 