Performance of HUMAnN Custom Database Creation

ChristianRomberg · May 24, 2024, 9:41am

Hi,
I am currently in the process of writing my bachelor thesis and I am using HUMAnN3 in the process. I noticed that the custom database creation step is quite slow for what is seems to be doing and decided to profile that step.

What I noticed is that the custom database creation seems to be spending most of its time evaluating regular expressions! It seems that the custom database creation has not changed significantly in the master branch of the github repository.
As a bioinformatician, this seems like a good first issue for me and I’d like to try to fix that.
Is there any guidance as to how I should approach this? Do you prefer tiny pull requests with small fixes or rather a cleaned up more polished approach? Are there tests for the custom database creation step in place?

Kind regards,
Chris

ChristianRomberg · May 24, 2024, 11:45am

If I change these lines in the prescreen file, I can archive speed increases of the database creation by about two orders of magnitude on my machine.

What I did is the following:

    # identify the files to be used from the ChocoPhlAn database
    species_file_list = []
    if not config.bypass_prescreen:
        for species_file in os.listdir(chocophlan_dir):
            for species in species_found:
                # match the exact genus and species from the MetaPhlAn (or custom) list
                new_database_file = os.path.join(chocophlan_dir, species_file)
                **if species.lower() + "." in species_file.lower() or species.lower() + "_group." in species_file.lower():**
                    species_file_list.append(new_database_file)
                    logger.debug("Adding file to database: " + species_file)
                    break
    else:
        for species_file in os.listdir(chocophlan_dir):
            species_file_list.append(os.path.join(chocophlan_dir, species_file))
            logger.debug("Adding file to database: " + species_file)

It uses the string contains operator instead of compiling a regex pattern for each species multiple times.

I only noticed this when using really large samples with multiple hundred species in the prescreening step, with a lower number of species the performance hit is not as noticable.

With these modifications, a database creation that took 2436 seconds before now takes 28 seconds.

franzosa · June 20, 2024, 9:03pm

Thanks for your idea. I know under normal circumstances this step is not the bottleneck in the HUMAnN workflow, but it seems like it can become time-consuming with 100s of species. I will transfer this message to the feature/pull request topic so our developers can consider it.

Topic		Replies	Views
Custom Database Creation Consumes All Disk Space HUMAnN	1	467	June 3, 2021
Thoughts on custom humann3 reference databases HUMAnN	7	1849	April 3, 2023
Inquiry on custom chocophlan database HUMAnN	2	146	November 10, 2023
Custom chocophlan database HUMAnN	3	446	August 8, 2023
No results reported in humann output HUMAnN	1	325	July 14, 2020

Performance of HUMAnN Custom Database Creation

Related topics