Getting humann and metaphlan to integrate

One of the most common issues I am seeing across this forum is compatibility between humann and metaphlan versions, their databases, and/or how to properly set them up and integrate them. Since metaphlan does not require humann, I see it as more of a humann issue.

I wanted to document the complexity of understanding humann and metaphlan integration, as well as a demonstration now that I have (mostly, see bottom) been able to do it, somewhat as documentation for supervisors as to my efforts, as well as for knowledge transfer to and reproduction by peers/colleagues. I am not sure how most people are figuring this out, but I’ve been installing, testing, and managing bioinformatic software for years and this was rather complicated, though ultimately the solution was fairly simple.

I think one major aspect of the difficulty integrating these tools is documentation that is overly simplified, ambiguous, or not updated. I understand that this is time-consuming on the developer side given that there are two tools with their own databases that requires a unidirectional interaction. On the user side, it gets extremely convoluted to follow available documentation that does not yield a functional package (below), and then scour the --helps and issues for solutions that may not be feasible or may not work (as expected and possibly be unaware that it has not worked as expected). Sometimes humann3 can run with its native databases but sometimes it can’t and must have a metaphlan3/4 database (depending on versions of each, see issue 5838 for some confusion), and it is often unclear until it is attempted. Sometimes database versions are referred to as “2019”, “Jun23”, “June 2023”, “3”, “full”, “bioBakery 3.1 pangenomes”, and likely more across the available documentation and forum. I think the closest to official documentation for humann3.5+ regarding the database compatibility is the v3.8 release note where is says “Oct22 SGB marker database” but would still benefit from a more explicit “for humann v3.8, use metaphlan v4.1 and its mpa_vOct22_CHOCOPhlAnSGB_202212 database (metaphlan --install --index mpa_vOct22_CHOCOPhlAnSGB_202212 --bowtie2db /your/path)”. Yes, i understand that I can go to metaphlan’s documentation and get this information - link to it please! Even then, there are no examples in available documentation (github readme, github wiki, humann landing page) of how to use a metaphlan (v4) database while running humann (v3.5+), despite being at least sometimes necessary (depending on humann version, and unless metaphlan is run independently), though this is also not apparent from available documentation (has been theoretically shown in [no particular order] (1) issue 6875 that was bumped to metaphlan but my guess is a conflict between specification of both a metaphlan index and nucleotide database, (2) issue 5838 that had a database/version conflict but I think also unnecessarily uses both --taxonomic-profile and --metaphlan-options ‘--index XY' where the second component could cause issues and possibly also conflict when passing both a metaphlan index and a humann nucleotide database).

Another aspect is that humann, in my experience, is simply not “plug and play” such that a user can install the tool, download the database using the appropriate utilities, and run it successfully. I note that the biobakery docker hub profile includes up to humann v3.9, but only metaphlan v4.0, and so the “latest” are not compatible and humann v3.5+ completely lack a compatible image. This week, by default, conda installs of humann v3.5-3.9 are packaged with metaphlan v 4.2, with humann v3.9 being completely non-functional (“humann” is tab-complete-able but execution returns ModuleNotFoundError: No module named ‘humann’), while an install in Feb of this year (2025) I installed humann v3.9 with metaphlan 3.0 and humann v4.0 with metaphlan v2.0, none of which yield a compatible combination of humann and metaphlan (according to release notes and/or testing). The relevant nucleotide database that humann installs (v3,5+ is full chocophlan from 2019) cannot be used by compatible versions of metaphlan (v4+ are several more recent that include SGBs, see above snippet), and the databases that compatible versions of metaphlan uses cannot be used by humann. These databases are completely different in structure (humann v3.5+ databases are directory with each marker as a compressed nucleotide file, metaphlan 4.0+ databases are a bowtie2 index), and so direct use compatibility should not be expected, but only if a user looks at them might they realize this. In a separate thread, I discovered that humann (v3.8) does not interpret metaphlan (v4.1) results correctly when a default parameter (analysis type, -t) is changed to a suggested, more data-rich setting (see issue 8355), so that while it runs completely, the behavior is unexpected and the results are not complete.

Maybe most or all of this will all be solved with humann v4.0+ and metaphlan v4.2, but until then…

Below, I show my process to create and use a successful installation of metaphlan v4.1 and humann v3.8, along with some of the issues that I describe above and a solution to get metaphlan and humann (mostly) integrated.

Here, an install of the most recent, (mostly) functional, and compatible metaphlan and humann versions and databases. Note, it is somewhat simplified for ease. I think the workflow is what matters as none of the code is particularly complex or unexpected.

python --version
Python 3.13.5
conda --version
conda 24.1.2
conda create humann3.8 -y
conda activate humann3.8
conda install humann=3.8 metaphlan=4.1 -y # install is humann 3.9 is non-functional, and must force downgrade of metaphlan because 4.2 is incompatible.
humann_databases --download chocophlan full /path/to/humanndatabases --update-config yes
humann_databases --download uniref uniref90_diamond /path/to/databases --update-config yes
humann_databases --download utility_mapping full /path/to/databases --update-config yes
metaphlan --install --bowtie2db /path/to/metaphlandatabase --index mpa_vOct22_CHOCOPhlAnSGB_202403

Here, several failed attempts to use the installed databases directly, despite following available documentation and what feels like intuitive changes to try to attain fucntionality.

humann --input reads.fq --output humannoutput
[simplified for brevity]
The MetaPhlAn taxonomic profile provided was not generated with the database version v3 or vOct22 . Please update your version of MetaPhlAn to at least v3.0 or if you are using MetaPhlAn v4 please use the database vOct22.
[humann ends now]

[expected for a fresh install of the database because the bowtie index that is required is generated during the first run, supposedly. This conflicts with documentation that suggests that using using the humann_databases utility with --update-config yes is sufficient for preparation of the databases]

humann --input reads.fq --output humannoutput –nucleotide-database/path/to/humanndatabases/full_chocophlan.v201901_v31
[simplified for brevity]
ERROR: The MetaPhlAn taxonomic profile provided was not generated with the database version v3 or vOct22 . Please update your version of MetaPhlAn to at least v3.0 or if you are using MetaPhlAn v4 please use the database vOct22.
[humann ends now]

[I believe that this is a case where humann cannot use its native database and must use a metaphlan database]

humann --input reads.fq --output humannoutput --nucleotide-database /path/to/metaphlandatabase/mpa_vOct22_CHOCOPhlAnSGB_202403
[simplified for brevity]
CRITICAL ERROR: The directory provided for the ChocoPhlAn database at path/to/metaphlandatabase/mpa_vOct22_CHOCOPhlAnSGB_202403 does not exist. Please select another directory.
[humann ends now]

[expected, once a solution is discovered, but to me this is an intuitive approach]

Here, an example of how i was able to integrate a metaphlan (v4.1) database while running humann (v3.5+).

humann --input reads.fq --output humannoutput --metaphlan-options “--bowtie2db / /path/to/metaphlandatabase --index mpa_vOct22_CHOCOPhlAnSGB_202403”
[simplified for brevity]
Found t__ABC : ##.##% of mapped reads
[humann run completes]

I am sorry if this was a bit vent-ey. I appreciate that the work you all are doing is extremely challenging from start to end, or at least some semblance of a “end”.

By the way, should the full_chocophlan.v201901_v31 contain alaS.centroids.v201901_v31.ffn.gz? it is the only non file that is not its own taxon.

ls -l full_chocophlan.v201901_v31/ | head
total 16213296
-rw-rw-r–+ 1 secretuser 875929 Apr 11 2022 alaS.centroids.v201901_v31.ffn.gz
-rw-rw-r–+ 1 secretuser 998654 Apr 11 2022 g__Abditibacterium.s__Abditibacterium_utsteinense.centroids.v201901_v31.ffn.gz
-rw-rw-r–+ 1 secretuser 586789 Apr 11 2022 g__Abiotrophia.s__Abiotrophia_defectiva.centroids.v201901_v31.ffn.gz
-rw-rw-r–+ 1 secretuser 552223 Apr 11 2022 g__Abiotrophia.s__Abiotrophia_sp_HMSC24B09.centroids.v201901_v31.ffn.gz
-rw-rw-r–+ 1 secretuser 624455 Apr 11 2022 g__Absiella.s__Absiella_dolichum.centroids.v201901_v31.ffn.gz
-rw-rw-r–+ 1 secretuser 1080093 Apr 11 2022 g__Abyssibacter.s__Abyssibacter_profundi.centroids.v201901_v31.ffn.gz
-rw-rw-r–+ 1 secretuser 2044528 Apr 11 2022 g__Acaryochloris.s__Acaryochloris_marina.centroids.v201901_v31.ffn.gz
-rw-rw-r–+ 1 secretuser 1630158 Apr 11 2022 g__Acaryochloris.s__Acaryochloris_sp_RCC1774.centroids.v201901_v31.ffn.gz
-rw-rw-r–+ 1 secretuser 804195 Apr 11 2022 g__Acetanaerobacterium.s__Acetanaerobacterium_elongatum.centroids.v201901_v31.ffn.gz
[tail -n 1 is g__Zymomonas.s__Zymomonas_mobilis.centroids.v201901_v31.ffn.gz, so I presume there is nothing in between and thus alaS seems to be the only one of its kind]