Performance of HUMAnN Custom Database Creation

Hi,
I am currently in the process of writing my bachelor thesis and I am using HUMAnN3 in the process. I noticed that the custom database creation step is quite slow for what is seems to be doing and decided to profile that step.

What I noticed is that the custom database creation seems to be spending most of its time evaluating regular expressions! It seems that the custom database creation has not changed significantly in the master branch of the github repository.
As a bioinformatician, this seems like a good first issue for me and I’d like to try to fix that.
Is there any guidance as to how I should approach this? Do you prefer tiny pull requests with small fixes or rather a cleaned up more polished approach? Are there tests for the custom database creation step in place?

Kind regards,
Chris

If I change these lines in the prescreen file, I can archive speed increases of the database creation by about two orders of magnitude on my machine.

What I did is the following:

    # identify the files to be used from the ChocoPhlAn database
    species_file_list = []
    if not config.bypass_prescreen:
        for species_file in os.listdir(chocophlan_dir):
            for species in species_found:
                # match the exact genus and species from the MetaPhlAn (or custom) list
                new_database_file = os.path.join(chocophlan_dir, species_file)
                **if species.lower() + "." in species_file.lower() or species.lower() + "_group." in species_file.lower():**
                    species_file_list.append(new_database_file)
                    logger.debug("Adding file to database: " + species_file)
                    break
    else:
        for species_file in os.listdir(chocophlan_dir):
            species_file_list.append(os.path.join(chocophlan_dir, species_file))
            logger.debug("Adding file to database: " + species_file)

It uses the string contains operator instead of compiling a regex pattern for each species multiple times.

I only noticed this when using really large samples with multiple hundred species in the prescreening step, with a lower number of species the performance hit is not as noticable.

With these modifications, a database creation that took 2436 seconds before now takes 28 seconds.