Dereplicating and normalizing reads for humann and metaphlan

Hello,

I have a few hundred deeply sequenced gut metagenome samples that I want to run humann3 and metaphlan4 on. It takes about 1.5 TB memory and 4-5 days to finish one sample, and I don’t have that much resources to complete this many samples in short time.

So, I wanted to ask if this is fine to dereplicate and/or normalize samples reads with tools like bbtools before running metaphlan and humann? Performing dereplication and/or normalization steps greatly reduces the file size hence making it easier to run humann/metaphlan on all these samples.

If this is not approperiate, then please suggest any other ways to successfully run particularly humann on these deeply sequenced samples.

Please advise if it is approperiate to do that.

Many thanks

If I am understanding your ask, it is common (though still controversial) to subsample reads to the same depth, often the lowest of the reads you would like to compare. For example, if you have 3 samples of 10, 20, and 100 million reads, you could subsample all three to 10 million reads. This can be done using BBtools reformat.sh and specifying samplereadstarget, which should take no more than 3-5 minutes per sample. If the lowest depth sample is still too high, just pick a lower samplereadstarget you think is good, but be wary of under-sampling.

However, it is also possible that reducing reads may not reduce the resource demand as much as you might think because the databases themselves are large.

By the way, dereplicating reads may mask real signal if the same region of DNA is sequenced twice (possible when deeply sequencing for some systems), noting that it may not be the same as duplicates the RNASeq world.

This is slightly off-topic, but may I ask why you acquired deep sequencing if you need subsampling due to computational resource limitations?

1.5 TB of memory for HUMAnN would be unprecedented. What sort of sequencing depth are we talking about here?

There are about 1.2-1.5 billion reads in each sample. We sequenced the samples this deep to identify any novel species. I was thinking of dereplicating the samples with clumpify.sh and setting dedeupe=true as it remove exact duplicates making it easier to run humann on.

Metaphlan works fine on this sample but its just humann that gets stuck.