Extract_markers.py: input database path format

From the extract_markers.py script docs, it is not clear how the --database path should be formatted. I originally tried using 4.0.3, which is the directory name holding all of the bowtie2 db files, but I received the error: Could not locate a Bowtie index corresponding to basename "4.0".

After looking at the generate_markers_fasta function, I see that one must supply the bowtie2 database basename + a file extension (e.g., mpa_vJan21_CHOCOPhlAnSGB_202103.md5). It would be helpful if the extract_markers.py script docs included that info.

Notably, for metaphlan, I could just use the directory containing the bowtie2 database (e.g., 4.0.3), so the UI seems to differ between metaphlan and extract_markers.py.

Also, it appears that extract_markers.py assumes that the *.pkl database file is bzip-compressed:

    def load_database(self, verbose=True):
        """Loads the MetaPhlAn PKL database"""
        if self.database_pkl is None:
            if verbose:
                info('Loading MetaPhlAn {} database...'.format(self.get_database_name()))
            self.database_pkl = pickle.load(bz2.BZ2File(self.database))
            if verbose:
                info('Done.')

Maybe a try - except would be helpful here, in order to allow for the input of an uncompressed pickle file? Does bzip2 compression really help reduce the size of the pickle file?

Hi @nick-youngblut
For all strainphlan-related scripts, the --database should point to the metaphlan PKL database (that is always bz2 compressed when exported by us even if not in the file extension). I will update the docs to make this fact clearer

Thank you for updating the docs. I’m surprised that bzip2 compression helps with pickled files, since they are already binary. Regardless, thank you for clarifying!