Custom MetaPhlAn subspecies taxonomy + corresponding HUMAnN ChocoPhlAn pangenome design

Hi bioBakery team,

I am working on a project where I want to achieve subspecies-level resolution in HUMAnN, specifically within Bifidobacterium longum (subsp. infantis and subsp. longum).

I have already modified the MetaPhlAn marker database to distinguish these subspecies successfully at the taxonomic profiling level.

I am running HUMAnN with a custom MetaPhlAn database and passing MetaPhlAn options through HUMAnN. However, I am encountering an error related to missing abundance/coverage fields in the MetaPhlAn taxonomic profile.

humann -i sample1_interleaved.fastq.gz -o output/ --memory-use maximum --remove-temp-output --metaphlan-options “-x mpa_vOct22_CHOCOPhlAnSGB_202403 --bowtie2db customdb/ -t rel_ab_w_read_stats” and error recieved:

ERROR: The relative abundance and coverage were not found in the MetaPhlAn taxonomic profile.
Please run MetaPhlAn with the option(s): -t rel_ab_w_read_stats.

However, I would like to understand the correct and supported way to make HUMAnN functional profiling consistent with this custom taxonomy.

My questions are:

  1. HUMAnN documentation states that ChocoPhlAn is a pangenome database used for nucleotide-level mapping.
    If I introduce new taxa (e.g., subspecies) in MetaPhlAn, what is the correct way to extend or rebuild ChocoPhlAn accordingly?

  2. Is there a recommended pipeline to:

    • define subspecies-level pangenomes

    • generate corresponding gene families

    • ensure compatibility with HUMAnN naming conventions?

  3. Does HUMAnN strictly require ChocoPhlAn entries to match MetaPhlAn taxonomic labels one-to-one, or can multiple pangenomes map to a single species-level entry?

  4. Are there any internal tools or workflows used by the bioBakery team to regenerate ChocoPhlAn when taxonomy is modified?

1 Like

@franzosa ? Kindly answer the query

You would have to build a pangenome in HUMAnN’s format that would match each of your recognized subspecies. You could probably do this by simply subdividing the species’ total pangenome into overlapping subsets representing the different subspecies, but this isn’t something we have any official support for.

If it were me I’d simply profile against the full species pangenome normally and then look at the subsets that light up in relation to the subspecies you believe are present. Usually when there is strong subspecies-level structure you can see it very clearly in the presence/absence patterns of genes within the species pangenome across samples (a big universal gene block corresponding to the species’ core genome + separate blocks that are found in some subspecies but not others).