Metaphlan V4.0.2 and Huma 3.6: MetaPhlAn taxonomic profile provided was not generated with the expected database

Continuing the discussion from Metaphlan4 and the new version humann:

I am having difficulty with the new DBs, metaphlan 4, and human 3.6.

$ metaphlan --version
MetaPhlAn version 4.0.2 (22 Sep 2022)

$ humann --version
humann v3.6

I ran metaphlan successfully; here is part of the resulting file

head 1064K_IGO_11592_240_metaphlan3_profile.txt
#mpa_vJan21_CHOCOPhlAnSGB_202103
#/usr/local/bin/metaphlan kneaddata/1064K_IGO_11592_240_knead_cat.fastq.gz --bowtie2db /resources/biobakery_workflows_dbs/metaphlanV4 --index mpa_vJan21_CHOCOPhlAnSGB_202103 --input_type fastq --sample_id 1064K_IGO_11592_240 -s metaphlan/1064K_IGO_11592_240.sam.bz2 --add_viruses --unclassified_estimation --nproc 16 -t rel_ab_w_read_stats -o metaphlan/1064K_IGO_11592_240_metaphlan3_profile.txt
#45176646 reads processed
#SampleID       1064K_IGO_11592_240
#estimated_reads_mapped_to_known_clades:249400536
#clade_name     clade_taxid     relative_abundance      coverage        estimated_number_of_reads_from_the_clade
UNCLASSIFIED    -1      26.74675        -       0
k__Bacteria     2       73.2333 7.03189 249212104
k__Eukaryota    2759    0.01994 0.00191 188432
k__Bacteria|p__Firmicutes       2|1239  58.3959 5.60719 172487448

But when I try to run Humann, I get an error about needing results from metaphlan v3 or higher:

$ humann      \
   --input kneaddata/1064K_IGO_11592_240_knead_cat.fastq.gz          \
   --output humann     --output-basename 1064K_IGO_11592_240_humann3     \
   --o-log humann/1064K_IGO_11592_240_humann.log       \
    --search-mode uniref90     \
    --remove-column-description-output   \
   --protein-database /resources/biobakery_workflows_dbs/uniref90_201901b \
   --nucleotide-database /resources/biobakery_workflows_dbs/v31/choco_v201901_v31   \
   --taxonomic-profile metaphlan/1064K_IGO_11592_240_metaphlan3_profile.txt \
   --threads 4

Output files will be written to: /data/brinkvd/watersn/apps/humann
Decompressing gzipped file ...



ERROR: The MetaPhlAn taxonomic profile provided was not generated with the expected database version. Please update your version of MetaPhlAn to at least v3.0.

What is the source of the issue? It looks like this is the method the might raise the error, but because the actual exception is not raised don’t know what needs to change.

I get the same problem. I’ve tried indexing the metaphlan database by hand, but there is something in the script which identifies inconsistencies even when I use the older database V29 or V31 (the script downloads and installs metaphlan4 from what I’ve seen). I haven’t been able to install metaphlan3 to test this as I can’t find this version to download. Does anyone know of an archive of metaphlan3?

Conda and pip installs both have this problem. I’m pretty sure metaphlan isn’t updating when running:
humann -i demo.fastq -o sample_results

humann_databases --download chocophlan full
seems to download the 202103 version of chocophlan.
Using --index in metaphlan, it searches in the metaphlan4 directory which doesn’t contain the older 2019 versions. But downloading the older versions and manually indexing doesn’t seem to work either…

(EDIT: My explanation here is not correct; please see my subsequent reply below for the correct explanation.)

Hi All - It looks like MetaPhlAn 4.0.2 adds a new row to the header indicating the number of reads that were mapped:

#45176646 reads processed

This is throwing off the test in HUMAnN that checks for compatibility between your MetaPhlAn profile and pangenome database. We can patch this in HUMAnN. As a temp solution you could roll back your MetaPhlAn version OR remove the “reads processed” line from your profile and that ought to restore the expected behavior.

Yeah, it weird. My metaphlan file should pass the checks the database version checking on line 157 here, but I don’t know for sure because for some reason it looks like it reads through and attempts to parse the entire file before actually raising a error about db version and exiting on line 213 here. So even if it detects an invalid version its still going to try to parse every line in the file before it would quit with the proper error The MetaPhlAn taxonomic profile provided was not generated with the database version <x> or <y>.

Also looks like there is a syntax error on line 215, where

                config.metaphlan_v3_db_version+" or "+metaphlan_v4_db_version+" . Please update your version of MetaPhlAn to at least v3.0."

should probably be

                config.metaphlan_v3_db_version+" or "+config.metaphlan_v4_db_version+" . Please update your version of MetaPhlAn to at least v3.0."

Otherwise you get the


Decompressing gzipped file ...

Traceback (most recent call last):
  File "/home/watersn/miniconda3/bin/humann", line 33, in <module>
    sys.exit(load_entry_point('humann', 'console_scripts', 'humann')())
  File "/lila/home/watersn/GitHub/humann/humann/humann.py", line 979, in main
    custom_database = prescreen.create_custom_database(config.nucleotide_database, bug_file)
  File "/lila/home/watersn/GitHub/humann/humann/search/prescreen.py", line 215, in create_custom_database
    config.metaphlan_v3_db_version+" or "+metaphlan_v4_db_version+" . Please update your version of MetaPhlAn to at least v3.0."
NameError: name 'metaphlan_v4_db_version' is not defined

Thanks for the reply @franzosa . I am happy to help write a validator if that would be helpful. Then we could have some tests in the repo to ensure that the code handles all the versions of metaphlan versions should be supportable.

My explanation above is not correct (I will flag it after posting this). @lauren.j.mciver pointed out that your MetaPhlAn profile has extra output columns from running in a non-default mode. This is the source of the error (not the added read count). We discussed this recently here as well:

If the alternate MetaPhlAn output is becoming more popular with users we can add some robustness to the extra columns in a future HUMAnN release. Sorry for any confusion caused by my earlier reply!

Thanks @franzosa, that’s only part of the issue. I selected the first three of the columns from the same file and ran again, but there is an issue with line endings breaking the isdigit().

if you change:

data=line.split("\t")
if data[-1].replace(".","").replace("e-","").isdigit():

to

data=line.strip().split("\t")
if data[-1].replace(".","").replace("e-","").isdigit():

it works fine. Not sure why it isn’t an issue with the original files.

For instance, changing the get_abundance() function to:

def get_abundance(line):                                                                                                                                                                                                                                                                                                                 
    """                                                                                                                                                                                                                                                                                                                                  
    Read in the abundance value from the taxonomy file                                                                                                                                                                                                                                                                                   
    """                                                                                                                                                                                                                                                                                                                                  
    try:                                                                                                                                                                                                                                                                                                                                 
        data=line.split("\t")                                                                                                                                                                                                                                                                                                            
        if data[-1].replace(".","").replace("e-","").isdigit():                                                                                                                                                                                                                                                                          
            read_percent=float(data[-1])                                                                                                                                                                                                                                                                                                 
        else:                                                                                                                                                                                                                                                                                                                            
            read_percent=float(data[-2])                                                                                                                                                                                                                                                                                                 
    except Exception as e:                                                                                                                                                                                                                                                                                                               
        print(repr(data[-1].replace(".","").replace("e-","")) )                                                                                                                                                                                                                                                                                
        print(e)                                                                                                                                                                                                                                                                                                                         
        print(repr(line))                                                                                                                                                                                                                                                                                                                
        message="The MetaPhlAn taxonomic profile provided was not generated with the expected database version. Please update your version of MetaPhlAn to at least v3.0."                                                                                                                                                               
        logger.error(message)                                                                                                                                                                                                                                                                                                            
        sys.exit("\n\nERROR: "+message)                                                                                                                                                                                                                                                                                                  

I get

Output files will be written to: /lila/home/watersn/GitHub/humann/humann
Decompressing gzipped file ...

'3439319\n'

could not convert string to float: '2|1239|186801|186802|541000|946234|292800'
'k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Flavonifractor|s__Flavonifractor_plautii\t2|1239|186801|186802|541000|946234|292800\t34.39319\n'


ERROR: The MetaPhlAn taxonomic profile provided was not generated with the expected database version. Please update your version of MetaPhlAn to at least v3.0.

but adding the change above removes the error.

NOTE: this means the existing code for checking against the config.prescreen_threshold is broken
if you run the current version (f147b28ea9 ) on the test data in ./humann/tests/data/demo_metaphlan_bugs_list.tsv, the read_percent for all of those entries is reported as 1:
data[-1].replace(".","").replace("e-","").isdigit() is never true, so it falls back to float(data[-2]) , which is the taxid.

Hi @nickp60 , If you run without the MetaPhlAn option -t rel_ab_w_read_stats you should be set! By default MetaPhlAn writes the relative abundances to the second to the last column.

Thank you,
Lauren

If that’s the case than the test data and the documentation should reflect that — currently both show the relative abundances to be in the last column.

For my purposes I can do cut -f 1-4 [myfile] > myfile_cut.txt and run it on the trimmed file.

Either way, the bug I showed above should be probably be rectified. Under normal circumstances I would submit a PR with a fix; let me know if you would be receptive to that.

Hi @nickp60 , Thanks for the info! It is possible we have some examples (tests and docs) that reflect an older version of MetaPhlAn. I will look into those that you noted and make sure to get them updated to the latest version of MetaPhlAn. I think the code itself is okay as I have confirmed it works with the latest MetaPhlAn version (default output).

Thanks!
Lauren

No problem!

Please feel free to check:

       if data[-1].replace(".","").replace("e-","").isdigit():                                                                                                                                                                                                                                                                          
            read_percent=float(data[-1])                                                                                                                                                                                                                                                                                                 

is never going to evaluate to True unless you strip the whitespace. If that functionality not needed due to the input requirements, it would clarify things to get rid of it. Also see my comments above: line 215 has a bug that will cause a name error instead of warning the user that the wrong DB is being used. You can verify it as follows:

# pretend we have metaphlan from an invalid db
curl https://raw.githubusercontent.com/biobakery/humann/master/humann/tests/data/demo_metaphlan_bugs_list.tsv | sed "s|#v30_CHOCOPhlAn|#v29_CHOCOPhlAn|" > demo_metaphlan_bad_bugs_list.tsv

$ humann --input $PWD/tmp.fastq.gz  --output humann  --taxonomic-profile demo_metaphlan_bad_bugs_list.tsv 

Output files will be written to: /lila/home/watersn/GitHub/humann/humann
Decompressing gzipped file ...

Traceback (most recent call last):
  File "/home/watersn/miniconda3/bin/humann", line 33, in <module>
    sys.exit(load_entry_point('humann', 'console_scripts', 'humann')())
  File "/lila/home/watersn/GitHub/humann/humann/humann.py", line 979, in main
    custom_database = prescreen.create_custom_database(config.nucleotide_database, bug_file)
  File "/lila/home/watersn/GitHub/humann/humann/search/prescreen.py", line 215, in create_custom_database
    config.metaphlan_v3_db_version+" or "+metaphlan_v4_db_version+" . Please update your version of MetaPhlAn to at least v3.0."
NameError: name 'metaphlan_v4_db_version' is not defined

See the note above as well that that only gets executed after the file has been (successfully) parsed – the user has no way of knowing whether the error is due to an invalid line in the bugs list or due to an incompatible database used to generate the bugs list.

We use biobakery tools a lot in our lab, and we appreciate all the work that has gone into them. I am more than happy to help write tests, refactor etc to keep these sorts of things from popping up.

Thanks!

Nick