Dereplicating and normalizing reads for humann and metaphlan

Hello,

I have a few hundred deeply sequenced gut metagenome samples that I want to run humann3 and metaphlan4 on. It takes about 1.5 TB memory and 4-5 days to finish one sample, and I don’t have that much resources to complete this many samples in short time.

So, I wanted to ask if this is fine to dereplicate and/or normalize samples reads with tools like bbtools before running metaphlan and humann? Performing dereplication and/or normalization steps greatly reduces the file size hence making it easier to run humann/metaphlan on all these samples.

If this is not approperiate, then please suggest any other ways to successfully run particularly humann on these deeply sequenced samples.

Please advise if it is approperiate to do that.

Many thanks

If I am understanding your ask, it is common (though still controversial) to subsample reads to the same depth, often the lowest of the reads you would like to compare. For example, if you have 3 samples of 10, 20, and 100 million reads, you could subsample all three to 10 million reads. This can be done using BBtools reformat.sh and specifying samplereadstarget, which should take no more than 3-5 minutes per sample. If the lowest depth sample is still too high, just pick a lower samplereadstarget you think is good, but be wary of under-sampling.

However, it is also possible that reducing reads may not reduce the resource demand as much as you might think because the databases themselves are large.

By the way, dereplicating reads may mask real signal if the same region of DNA is sequenced twice (possible when deeply sequencing for some systems), noting that it may not be the same as duplicates the RNASeq world.

This is slightly off-topic, but may I ask why you acquired deep sequencing if you need subsampling due to computational resource limitations?

1.5 TB of memory for HUMAnN would be unprecedented. What sort of sequencing depth are we talking about here?

There are about 1.2-1.5 billion reads in each sample. We sequenced the samples this deep to identify any novel species. I was thinking of dereplicating the samples with clumpify.sh and setting dedeupe=true as it remove exact duplicates making it easier to run humann on.

Metaphlan works fine on this sample but its just humann that gets stuck.

However, it is also possible that reducing reads may not reduce the resource demand as much as you might think because the databases themselves are large.

By the way, dereplicating reads may mask real signal if the same region of DNA is sequenced twice (possible when deeply sequencing for some systems), noting that it may not be the same as duplicates the RNASeq world.

That’s the crucial question. With 1.5 billion reads per sample (as mars wrote below), you almost inevitably run into a resource limit with HUMAnN, especially during the “Translated Search” (Diamond) step.

I would agree with the advice on subsampling. Functional profiling (gene families and pathways) often reaches saturation much sooner than taxonomic discovery (novel species). It’s common practice to subsample HUMAnN to, for example, 20-50 million reads to save runtime and use the full depth only for assembly/binning.

Good luck with your analysis!

Could you clarify what you mean by “identify novel species”?
HUMAnN requires a predefined reference to which reads are queried. Therefore, in classic microbiology terms, it is simply not possible to identify a “novel species”, ie a bacteria has never before been identified and confirmed to be distinct from other species of the same genus. If a bacteria is queried by HUMAnN, it simply cannot be a novel species.
However, if you mean by “novel” that, for example, it has not been observed in your sample type, cohort, etc. previously but is already known, that seems more appropriate. Purely hypothetically as I dont know the system, you find E coli in earwax for the first time, that species is “novel” to earwax, and theoretically humann would support that hypothesis/claim.
Either way, you will need to be careful with a simple one-off subsampling approach. if the member you are looking for is low abundance, it might not be detected in one subsample but could be in another. Consider a pizza - the toppings are not perfectly evenly distributed and some slices will have more and others less.

I think it’s wise to note that using bbtools dedupe is purely based on nucleotide identity, so one read is picked out of any number of identical sequences, in contrast to specifically technical artifact duplicates (RNA-Seq). It is possible to sequence the exact same region twice, and this likelihood increases with greater depth along with the greater chance to sequence any region once. Therefore, if all but one of a pool of identical sequences is picked, metaphlan and humann counts and coverage calculations will be inaccurate.

I am not sure I understand how functions saturate before taxa.

Assuming a species has a set number of functions, they would saturate at the same rate. However, we know that any given microbial species’ genome is a singular, sometimes consensus, representative of its pangenome, which has an uncertain number of total genes, it seems theoretically impossible for functions to saturate first. For example, in HUMAnN, the E coli taxon has ~90K uniref ids attached to it, whereas any given E coli genome has ~4.5k functions.

There are known examples of genome reduction that contracts the pangenome because genes are lost, making the definition/interpretation of a pangenome kind of confusing, and arguably the pangenome can still increase due to mutations and other mechanisms.