Humann2 computation speed up

hanh1985 · March 20, 2020, 5:27pm

Greetings,

We have a few large metagenomics data up to ~ 20 GB. Such data takes more than 3 days to run on AWS cloud instance (34 threads, 70GB memory used). Is there’s any way to speed up the computing? For example, splitting the input data into many small files and distributed them into multiple nodes?

Regards,

-Han

franzosa · April 1, 2020, 3:44pm

Sorry for the delay. That does feel very slow - can you clarify what options you’re using for the run? The simplest speedup in HUMAnN 2.0 (that doesn’t alter the final results) is to use more threads, which will speed-up the various mapping steps (esp. translated search). Another options is --bypass-translated-search, but this is only advisable if you’re working with a very well-studied community (with lots of reference genomes / pangenome coverage).

I don’t think splitting up the input file would be any more effective than working with multiple threads, and there are some steps that wouldn’t translate well with this approach (e.g. coverage filtering, which benefits from “seeing” all the sample reads in the same run).

hanh1985 · April 1, 2020, 4:21pm

Hi Eric,

Thanks for the suggestion! It turns out the slow speed was caused by weired docker problem … The database of humann2 was wrapped into docker image with multiple layers. We changed the way of managing docker and database, and the speed problem was solved.

We frequently process hundreds of metagenomics samples on the cloud, and we want to scale up and down whenever possible. I noticed that during the humann2 processing, the CPU utilization becomes very low for a while and then boosts. This suggests that certain steps of humann2 can be computed with different cloud instances. Do you have any suggestion of the modules (e.g. single thread steps) we can look at?

Thanks,

-Han

franzosa · April 1, 2020, 4:28pm

Looking at the timestamps from a HUMAnN log file, the starred ones are parallelized (while the others are not):

prescreen*
custom database creation
database index
nucleotide alignment*
nucleotide alignment post-processing
translated alignment*
translated alignment post-processing
computing gene families
computing pathways*

Of these, translated alignment is by far the bottleneck in the process, followed by translated alignment post-processing.

Topic		Replies	Views
Speed up humann3 Data resource	1	729	June 26, 2020
Humann3 computation speed HUMAnN	1	2003	September 29, 2020
Is long computation time in humann3 related to not using knead data? HUMAnN	1	684	February 18, 2021
Optimising Humann run time - low species number - uniref database question HUMAnN	2	870	February 11, 2022
Using a single "general" MetaPhlAn bugs list for all samples HUMAnN	3	292	July 20, 2023

Humann2 computation speed up

Related topics