Sgb_to_gtdb_profile.py utility does not work out of the box in MetaPhlAn 4.0.5

Dear bioBakery team,

I just tried to convert my metaphlan4 profile to a GTDB one, and the utility sgb_to_gtdb_profile.py fails.

I launch it this way:

sgb_to_gtdb_profile.py -i SAMEA3178944/SAMEA3178944.metaphlan4_profile.txt -o SAMEA3178944/SAMEA3178944.metaphlan4_profile.txt.gtdb

And it fails with

Wed Mar  1 18:49:31 2023: Start execution
Wed Mar  1 18:49:31 2023: [Error] The default MetaPhlAn database cannot be found at: /opt/conda/lib/python3.9/site-packages/metaphlan/utils/../metaphlan_databases/mpa_latest
Wed Mar  1 18:49:31 2023: Stop StrainPhlAn execution.

When I look into the portion of code that spawn this error (metaphlan.utils.database_controller) , I see:

    def resolve_database(self, database):
        """Resolves the path to the MPA database

        Args:
            database (str): the name or path of the database

        Returns:
            str: the resolved path to the database
        """
        if database == 'latest':
            if os.path.exists(os.path.join(self.default_db_folder, 'mpa_latest')):
                with open(os.path.join(self.default_db_folder, 'mpa_latest'), 'r') as mpa_latest:
                    return '{}/{}.pkl'.format(self.default_db_folder, [line.strip() for line in mpa_latest
 if not line.startswith('#')][0])
            else:
                error('The default MetaPhlAn database cannot be found at: {}'.format(
                    os.path.join(self.default_db_folder, 'mpa_latest')), exit=True)
        else:
            return database

So there are two issues here: an option to specify database path (as in metaphlan main command) does not exist, and whatsoever, I could not find in the documentation what and where is that .PKL file or how to create it - it was not installed by metaphlan --install command - , nor did it show in Index of /biobakery4/metaphlan_databases. Also, it is a mystery to me why strainphlan appears in the output, I did not intend to use it.

Raynald

Hi @delahondes
As stated in the announcement of the new version and in the changelog, the voct22 gtdb taxonomy assigment was not ready yet. However, we just pushed the new version 4.0.6 that makes the script compatible with the latest version of the database.
The metaphlan database PKL file should be included in your default metaphlan_databases folder, you can check your default database folder by running metaphlan --help and checking the default value of the bowtie2db parameter. E.g.

–bowtie2db METAPHLAN_BOWTIE2_DB
Folder containing the MetaPhlAn database. You can specify the location by exporting the DEFAULT_DB_FOLDER variable in the shell.[default /…/anaconda3/envs/metaphlan-4/lib/python3.9/site-packages/MetaPhlAn44-4.0.4-py3.9.egg/metaphlan44/metaphlan_databases]

Hi @aitor.blancomiguez , I think I have found the PKL file, it is in http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vOct22_CHOCOPhlAnSGB_202212.tar tarfile. However it is not downloaded by metaphlan --install at least in 4.0.5, I did not re-test with 4.0.6

Hi @aitor.blancomiguez, first I must apologize because I was wrong (I was too quick to check thoroughly, sorry about that): metaphlan --install really install the .PKL file, including the 4.0.5 version.

I have seen your new option, which seems to get things working, but apparently I am still missing another file, bow_SGB2GTDB.tsv:

$ docker run --rm -it -v $(pwd)/resource:/resource -v $(pwd)/input:/input gmtscience/metaphlan4:4.0.6 bash
(base) root@8823dcbfa8a6:/tmp# ls /resource/metaphlan/bowtie2
mpa_latest                              mpa_vOct22_CHOCOPhlAnSGB_202212.pkl
mpa_vOct22_CHOCOPhlAnSGB_202212.1.bt2l  mpa_vOct22_CHOCOPhlAnSGB_202212.rev.1.bt2l
mpa_vOct22_CHOCOPhlAnSGB_202212.2.bt2l  mpa_vOct22_CHOCOPhlAnSGB_202212.rev.2.bt2l
mpa_vOct22_CHOCOPhlAnSGB_202212.3.bt2l  mpa_vOct22_CHOCOPhlAnSGB_202212_VINFO.csv
mpa_vOct22_CHOCOPhlAnSGB_202212.4.bt2l  mpa_vOct22_CHOCOPhlAnSGB_202212_VSG.fna
(base) root@8823dcbfa8a6:/tmp# sgb_to_gtdb_profile.py -d /resource/metaphlan/bowtie2 -i /input/SAMEA3178944.metaphlan4_profile.txt -o /input/SAMEA3178944.gtdb
Fri Mar  3 15:48:02 2023: Start execution
Traceback (most recent call last):
  File "/opt/conda/bin/sgb_to_gtdb_profile.py", line 10, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.9/site-packages/metaphlan/utils/sgb_to_gtdb_profile.py", line 96, in main
    get_gtdb_profile(args.input, args.output, database_controller.get_database_name())
  File "/opt/conda/lib/python3.9/site-packages/metaphlan/utils/sgb_to_gtdb_profile.py", line 55, in get_gtdb_profile
    with open(os.path.join(os.path.dirname(os.path.abspath(__file__)), "{}_SGB2GTDB.tsv".format(database)), 'r') as read_file:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/conda/lib/python3.9/site-packages/metaphlan/utils/bow_SGB2GTDB.tsv'
(base) root@8823dcbfa8a6:/tmp# 

I tried to redownload the database with 4.0.6 version but it does not make any difference. (a trailing slash in db parameter changes the error marginally:

(base) root@8823dcbfa8a6:/tmp# sgb_to_gtdb_profile.py -d /resource/metaphlan/bowtie2b/ -i /input/SAMEA3178944.metaphlan4_profile.txt -o /input/SAMEA3178944.gtdb
Fri Mar  3 16:21:55 2023: Start execution
Traceback (most recent call last):
  File "/opt/conda/bin/sgb_to_gtdb_profile.py", line 10, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.9/site-packages/metaphlan/utils/sgb_to_gtdb_profile.py", line 96, in main
    get_gtdb_profile(args.input, args.output, database_controller.get_database_name())
  File "/opt/conda/lib/python3.9/site-packages/metaphlan/utils/sgb_to_gtdb_profile.py", line 55, in get_gtdb_profile
    with open(os.path.join(os.path.dirname(os.path.abspath(__file__)), "{}_SGB2GTDB.tsv".format(database)), 'r') as read_file:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/conda/lib/python3.9/site-packages/metaphlan/utils/_SGB2GTDB.tsv'

Hi @delahondes
In version 4.0.6 you can run it as follows:
sgb_to_gtdb_profile.py -i /input/SAMEA3178944.metaphlan4_profile.txt -o /input/SAMEA3178944.gtdb

If you want to specify the db, you can run it this way:
sgb_to_gtdb_profile.py -d /resource/metaphlan/bowtie2b/mpa_vOct22_CHOCOPhlAnSGB_202212.pkl -i /input/SAMEA3178944.metaphlan4_profile.txt -o /input/SAMEA3178944.gtdb

Yes! it works, many thanks @aitor.blancomiguez !

$ docker run --rm -it -v $(pwd)/input:/input -v $(pwd)/resource:/resource gmtscience/metaphlan4:4.0.6 bash                                                   
(base) root@6f416f18b438:/tmp# cd /input/
(base) root@6f416f18b438:/input# ls
SAMEA3178944.metaphlan4_profile.txt  input
(base) root@6f416f18b438:/input# sgb_to_gtdb_profile.py -d /resource/metaphlan/bowtie2b/mpa_vOct22_CHOCOPhlAnSGB_202212.pkl -i /input/SAMEA3178944.metaphlan4_profile.txt -o /input/SAMEA3178944.gtdb
Tue Mar  7 18:11:53 2023: Start execution
Tue Mar  7 18:11:53 2023: Finish execution (0.14 seconds)
(base) root@6f416f18b438:/input# 

Not working for me, even when I specify the .pkl path it looks in the wrong place:

sgb_to_gtdb_profile.py -i metaphlan/elcho02.fasta.txt -d /g/data/nm31/db/metaphlan/mpa_vOct22_CHOCOPhlAnSGB_202212.pkl -o elcho02_gtdb.txt
Thu Mar 16 13:34:20 2023: Start execution
Traceback (most recent call last):
File “/home/554/ta0341/.local/bin/sgb_to_gtdb_profile.py”, line 8, in
sys.exit(main())
File “/home/554/ta0341/.local/lib/python3.9/site-packages/metaphlan/utils/sgb_to_gtdb_profile.py”, line 96, in main
get_gtdb_profile(args.input, args.output, database_controller.get_database_name())
File “/home/554/ta0341/.local/lib/python3.9/site-packages/metaphlan/utils/sgb_to_gtdb_profile.py”, line 55, in get_gtdb_profile
with open(os.path.join(os.path.dirname(os.path.abspath(file)), “{}_SGB2GTDB.tsv”.format(database)), ‘r’) as read_file:
FileNotFoundError: [Errno 2] No such file or directory: ‘/home/554/ta0341/.local/lib/python3.9/site-packages/metaphlan/utils/mpa_vOct22_CHOCOPhlAnSGB_202212_SGB2GTDB.tsv’

Hi @Theo_Allnutt, which version of metaphlan are you using? The SGB2GTDB script is supported for the Oct22 since version 4.0.6

I reinstalled the latest version and it is now working.
Thanks.