Extending the metaphlan4 database (Jan2025) to include subspecies

I’m trying to extend the current version (Jan2025) of the metaphlan so that the tool will estimate the abundances of two Bifidobacterium longum subspecies (subsp. longum and subsp. infantis). I’m following some published methodology here with code available here. I have successfully added the marker genes to the bowtie indexed database and updated the pkl file. My understanding is that the two subspecies are successfully added to the taxonomy and the marker genes are added to the markers with clades t__subsp.infantis and t__subsp.longum. Then the original B. longum clade/SGB (t__SGB17248) is then removed and the marker genes are assigned to a new clade (s__Bifidobacterium_longum).

However, when I run metaphlan4 the mapping is successful but I encounter an error with full traceback:

Traceback (most recent call last):
File "/opt/conda/bin/metaphlan", line 10, in <module>
sys.exit(main())
~~~~^^
File "/opt/conda/lib/python3.13/site-packages/metaphlan/metaphlan.py", line 2318, in main
metaphlan_runner.run_metaphlan()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
File "/opt/conda/lib/python3.13/site-packages/metaphlan/metaphlan.py", line 1919, in run_metaphlan
self.metaphlan_analysis.report_results(self.tree, self.total_metagenome, self.avg_read_length)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.13/site-packages/metaphlan/metaphlan.py", line 1385, in report_results
self.get_mapped_fraction()
~~~~~~~~~~~~~~~~~~~~~~~~^^
File "/opt/conda/lib/python3.13/site-packages/metaphlan/metaphlan.py", line 1267, in get_mapped_fraction
self.mapped = self.tree.relative_abundances()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
File "/opt/conda/lib/python3.13/site-packages/metaphlan/metaphlan.py", line 309, in relative_abundances
clade.compute_coverage()
~~~~~~~~~~~~~~~~~~~~~~^^
File "/opt/conda/lib/python3.13/site-packages/metaphlan/metaphlan.py", line 179, in compute_coverage
self.coverage = sum([child.compute_coverage() for child in self.children.values()])
~~~~~~~~~~~~~~~~~~~~~~^^
File "/opt/conda/lib/python3.13/site-packages/metaphlan/metaphlan.py", line 179, in compute_coverage
self.coverage = sum([child.compute_coverage() for child in self.children.values()])
~~~~~~~~~~~~~~~~~~~~~~^^
File "/opt/conda/lib/python3.13/site-packages/metaphlan/metaphlan.py", line 179, in compute_coverage
self.coverage = sum([child.compute_coverage() for child in self.children.values()])
~~~~~~~~~~~~~~~~~~~~~~^^
[Previous line repeated 4 more times]
File "/opt/conda/lib/python3.13/site-packages/metaphlan/metaphlan.py", line 184, in compute_coverage
rat_nreads = self.filter_markers_for_ext()
File "/opt/conda/lib/python3.13/site-packages/metaphlan/metaphlan.py", line 142, in filter_markers_for_ext
ext_markers2nreads = self.taxa2clades[ext].markers2nreads
~~~~~~~~~~~~~~~~^^^^^
KeyError: 'SGB17248'

Obviously, at some point the clade (SGB17248) isn’t completely removed/reassigned from the database but I am struggling to find it (I can see the SGB ID is appended as part of the marker gene IDs but I have no idea whether this is used for anything). Is there anything in the error that will point to what I have missed in re-building the database?

Hi @ryanbb

I am guessing the reason you encounter this issue is because within the database pickle file for each marker you have a list of external SGBs, which are SGBs different from the one that the marker is representing but that in a small percentage of cases has occurrences of that same marker. This means that the marker specifically doesn’t get used by MetaPhlAn when any of the external SGBs are present in the sample, to avoid false positive detections. What likely happened is that you are detecting an SGB which has SGB17248 as an external SGBs.

In practical terms what this means is that you will need to remove the occurrence of SGB17248 from the list of external markers of all SGBs. To be safe you could add the two subclades instead.

It should be something like:


mpa_example=pkl.load(bz2.open(‘mpa_vJan25_CHOCOPhlAnSGB_202503.pkl’,‘r’))
for m in mpa_example[‘markers’]:
    if ‘SGB17248’ in mpa_example[‘markers’][m][‘ext’]:
        mpa_example[‘markers’][m][‘ext’].remove(‘SGB17248’)
        mpa_example[‘markers’][m][‘ext’].append(‘subsp.longum’)
        mpa_example[‘markers’][m][‘ext’].append(‘subsp.infantis’)
1 Like

Thanks @Claudia_Mengoni this was indeed the issue. There were 4 other markers with external links to the SGB of interest. I can complete runs of metaphlan now.

1 Like