PhyloPhlAn: using a manually downloaded database

Dear PhyloPhlAn team and community,

Having run this command in PhyloPhlAn (version 3.0.60 (27 November 2020), installed through conda):
(myenv) usr@srvr:~$ phylophlan -i /home/usr/data/input/11assemblies -d /home/usr/phylophlan_databases phylophlan_databases --diversity medium -f supermatrix_nt.cfg --nproc 8

…resulted in the following:
Traceback (most recent call last):
File “/home/usr/miniconda3/envs/myenv/bin/phylophlan”, line 10, in
sys.exit(phylophlan_main())
File “/home/usr/miniconda3/envs/myenv/lib/python3.7/site-packages/phylophlan/phylophlan.py”, line 3227, in phylophlan_main
verbose=args.verbose)
File “/home/usr/miniconda3/envs/myenv/lib/python3.7/site-packages/phylophlan/phylophlan.py”, line 818, in init_database
for f in glob.iglob(os.path.join(folder, ‘*’))
File “/home/usr/miniconda3/envs/myenv/lib/python3.7/site-packages/phylophlan/phylophlan.py”, line 819, in
for _, seq in SimpleFastaParser(bz2.open(f, ‘rt’) if f.endswith(’.bz2’) else open(f))])
File “/home/usr/miniconda3/envs/myenv/lib/python3.7/site-packages/Bio/SeqIO/FastaIO.py”, line 47, in SimpleFastaParser
for line in handle:
File “/home/usr/miniconda3/envs/myenv/lib/python3.7/codecs.py”, line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xba in position 1035: invalid start byte
(myenv) usr@srvr:~$

I would greatly appreciate any feedback on how to fix this issue, so I could run PhyloPhlAn to construct a phylogeny of 11 whole-genome assemblies from pure cultured strains (no metagenomic data).

These assemblies are in this folder:
/home/usr/data/input/11assemblies

The database is in this folder:
/home/usr/phylophlan_databases
containing the following 2 files (i) and (ii), both manually downloaded via http://cmprod1.cibio.unitn.it/databases/PhyloPhlAn/phylophlan_databases.txt:
(i) phylophlan.tar
(downloaded from https://zenodo.org/record/4005620/files/phylophlan.tar?download=1)
(ii) phylophlan.md5
(downloaded from https://zenodo.org/record/4005620/files/phylophlan.md5?download=1)

Please note that I had to download the database manually (following suggestions found here: Using manually downloaded database · Issue #18 · biobakery/phylophlan · GitHub, “Using manually downloaded database #18”) because of limitations with the internet connection/firewall on my system.

Also, I have already come across this conversation: local variable 'input_faa_clean' referenced before assignment · Issue #9 · biobakery/phylophlan · GitHub (“local variable ‘input_faa_clean’ referenced before assignment #9”), suggesting that getting PhyloPhlAn directly from the repository would fix an issue which I guess is similar to mine (if not identical). Unfortunately, I can’t install PhyloPhlAn directly from the repository because of limitations with the internet connection/firewall on my system.

I would be very happy for any suggestions what I could do in order to get PhyloPhlAn running?

Thanks already in advance,
Michael

Hello Michael, sorry I didn’t receive the notification of your post.

I think the problem is that you’re specifying the folder containing the database(s) with the param you should specify which database to use from that folder.

The command:
phylophlan -i /home/usr/data/input/11assemblies -d /home/usr/phylophlan_databases phylophlan_databases --diversity medium -f supermatrix_nt.cfg --nproc 8

should be:
phylophlan -i /home/usr/data/input/11assemblies -d phylophlan --databases_folder /home/usr/phylophlan_databases phylophlan_databases --diversity medium -f supermatrix_nt.cfg --nproc 8

Please let me know if something is not clear.

Many thanks,
Francesco

Hello Francesco,

Thank you very much for your reply and analysis of the problem.

Having used your command and, based on it, slightly modified commands, I have unfortunately not yet succeeded in getting PhyloPhlAn running - but I hope to have gotten closer.

Below, I have included details about these commands, hoping that this may help in narrowing down the problem. (I have decided to split this information into separate posts so it is easier to grasp.)

Any feedback on this would be greatly appreciated.

Many thanks,
Michael

(1) With your command (as suggested under ‘should be:’ in your post)…
phylophlan
-i /home/usr/data/input/11assemblies
-d phylophlan
–databases_folder /home/usr/phylophlan_databases phylophlan_databases
–diversity medium
-f supermatrix_nt.cfg
–nproc 8

…I got the following:
usage: phylophlan [-h] [-i INPUT | -c CLEAN] [-o OUTPUT] [-d DATABASE]
[-t {n,a}] [-f CONFIG_FILE] --diversity {low,medium,high}
[–accurate | --fast] [–clean_all] [–database_list]
[-s SUBMAT] [–submat_list] [–submod_list] [–nproc NPROC]
[–min_num_proteins MIN_NUM_PROTEINS]
[–min_len_protein MIN_LEN_PROTEIN]
[–min_num_markers MIN_NUM_MARKERS]
[–trim {gap_trim,gap_perc,not_variant,greedy}]
[–gap_perc_threshold GAP_PERC_THRESHOLD]
[–not_variant_threshold NOT_VARIANT_THRESHOLD]
[–subsample {phylophlan,onethousand,sevenhundred,fivehundred,threehundred,onehundred,fifty,twentyfive,tenpercent,twentyfivepercent,fiftypercent,full}]
[–unknown_fraction UNKNOWN_FRACTION]
[–scoring_function {trident,muscle,random}] [–sort]
[–remove_fragmentary_entries]
[–fragmentary_threshold FRAGMENTARY_THRESHOLD]
[–min_num_entries MIN_NUM_ENTRIES] [–maas MAAS]
[–remove_only_gaps_entries] [–mutation_rates]
[–force_nucleotides] [–input_folder INPUT_FOLDER]
[–data_folder DATA_FOLDER]
[–databases_folder DATABASES_FOLDER]
[–submat_folder SUBMAT_FOLDER]
[–submod_folder SUBMOD_FOLDER]
[–configs_folder CONFIGS_FOLDER]
[–output_folder OUTPUT_FOLDER]
[–genome_extension GENOME_EXTENSION]
[–proteome_extension PROTEOME_EXTENSION] [–update]
[–citation] [–verbose] [-v]
phylophlan: error: unrecognized arguments: phylophlan_databases

(2) Thus, I removed “phylophlan_databases” from this command, resulting in the following command…
phylophlan
-i /home/usr/data/input/11assemblies
-d phylophlan
–databases_folder /home/usr/phylophlan_databases
–diversity medium
-f supermatrix_nt.cfg
–nproc 8

…for which I got the following:
[e] -t/–db_type not specified and could not automatically detect the input database file(s)

(3) Both “-t” and “–db_type” are listed as optional arguments in the PhyloPhlAn user manual (version Jun 3, 2020, section phylophlan.py), so I added them to the previous command, resulting in the following command…
phylophlan
-i /home/usr/data/input/11assemblies
-d phylophlan
-t n
–db_type n
–databases_folder /home/usr/phylophlan_databases
–diversity medium
-f supermatrix_nt.cfg
–nproc 8

…for which I got the following:
[e] database format ("/home/usr/phylophlan_databases/phylophlan/phylophlan.fna", “/home/usr/phylophlan_databases/phylophlan/phylophlan.fna.bz2”, or “/home/usr/phylophlan_databases/phylophlan”) not recognize

Please note that the folder
/home/usr/phylophlan_databases
contains the following 2 files:
phylophlan.tar
and
phylophlan.md5

To be more precise, these file paths are:
/home/usr/phylophlan_databases/phylophlan.tar
and
/home/usr/phylophlan_databases/phylophlan.md5

P.S.: Please let me know if any further information would be required to help solve this problem and get PhyloPhlAn running in my conda environment.

Thank you very much again,
Michael

Hello Michael,

Thanks for your messages. Sorry for the wrong command, definitely was a copy-paste mistake. The command you corrected:

looks good.

The error:

Is a bit strange as PhyloPhlAn should be able to automatically infer it from the database file(s).
Your fix -t n (no need for --db_type and -t is its short version) doesn’t work because the phylophlan database is a set of proteins, so you should specify -t a.

Will you be able to run this command (which should be correct now):

phylophlan -i /home/usr/data/input/11assemblies -d phylophlan -t a --databases_folder /home/usr/phylophlan_databases --diversity medium -f supermatrix_aa.cfg --nproc 8 --verbose 2>&1 | tee phylophlan_11assemblies.log

That should generate a log file phylophlan_11assemblies.log which would be great if you can share it with me if you still run into errors.

Many thanks,
Francesco

Hello Francesco,

Thanks a lot for your reply and efforts in further analyzing this problem.

Happy to hear that my command (quoted by you) looks good.

As you noted, I was also astonished that PhyloPhlAn should be able to automatically detect the database file(s), assuming that it would be sufficient to download the database files (phylophlan.tar and phylophlan.md5), put them in a folder and indicate the location of this folder using the “–databases_folder” command.

Also thanks for having explained and corrected my rookie mistake with the selection of “-t n” (wrong) instead of “-t a”.

Having run the command which you have kindly included in your reply, I still get an error:
[e] database format ("/home/usr/phylophlan_databases/phylophlan/phylophlan.faa", “/home/usr/phylophlan_databases/phylophlan/phylophlan.faa.bz2”, or “/home/usr/phylophlan_databases/phylophlan”) not recognize

Also, I have attached the phylophlan_11assemblies.log, and I am sorry that I have to come back to your kind offer sharing it with you (Biobakery wouldn’t let me upload a *.log file (“Sorry, the file you are trying to upload is not authorized (authorized extensions: jpg, jpeg, png, gif, txt, rtf, csv, tsv, biom).”), so I have changed the file extension to *.txt).

Many thanks again,
Michaelphylophlan_11assemblies.txt (2.5 KB)

P.S.:
This is the command and what I got from the Terminal:

(base) usr@srvr:~$ conda activate myenv
(myenv) usr@srvr:~$ phylophlan -i /home/usr/data/input/11assemblies -d phylophlan -t a --databases_folder /home/usr/phylophlan_databases --diversity medium -f supermatrix_aa.cfg --nproc 8 --verbose 2>&1 | tee phylophlan_11assemblies.log
PhyloPhlAn version 3.0.60 (27 November 2020)

Command line: /home/usr/miniconda3/envs/myenv/bin/phylophlan -i /home/usr/data/input/11assemblies -d phylophlan -t a --databases_folder /home/usr/phylophlan_databases --diversity medium -f supermatrix_aa.cfg --nproc 8 --verbose

Automatically setting “input=11assemblies” and “input_folder=/home/usr/data/input”
“medium-accurate” preset
Setting “sort=True” because “database=phylophlan”
Setting “min_num_markers=100” since no value has been specified and the “database=phylophlan”
Arguments: {‘input’: ‘11assemblies’, ‘clean’: None, ‘output’: ‘11assemblies_phylophlan’, ‘database’: ‘phylophlan’, ‘db_type’: ‘a’, ‘config_file’: ‘supermatrix_aa.cfg’, ‘diversity’: ‘medium’, ‘accurate’: True, ‘fast’: False, ‘clean_all’: False, ‘database_list’: False, ‘submat’: ‘pfasum60’, ‘submat_list’: False, ‘submod_list’: False, ‘nproc’: 8, ‘min_num_proteins’: 1, ‘min_len_protein’: 50, ‘min_num_markers’: 100, ‘trim’: ‘gap_trim’, ‘gap_perc_threshold’: 0.67, ‘not_variant_threshold’: 0.99, ‘subsample’: <function onehundred at 0x7f2f8cd8a3b0>, ‘unknown_fraction’: 0.3, ‘scoring_function’: <function trident at 0x7f2f8cd8a710>, ‘sort’: True, ‘remove_fragmentary_entries’: False, ‘fragmentary_threshold’: 0.85, ‘min_num_entries’: 4, ‘maas’: None, ‘remove_only_gaps_entries’: False, ‘mutation_rates’: False, ‘force_nucleotides’: False, ‘input_folder’: ‘/home/usr/data/input/11assemblies’, ‘data_folder’: ‘11assemblies_phylophlan/tmp’, ‘databases_folder’: ‘/home/usr/phylophlan_databases’, ‘submat_folder’: ‘/home/usr/miniconda3/envs/myenv/lib/python3.7/site-packages/phylophlan/phylophlan_substitution_matrices/’, ‘submod_folder’: ‘/home/usr/miniconda3/envs/myenv/lib/python3.7/site-packages/phylophlan/phylophlan_substitution_models/’, ‘configs_folder’: ‘/home/usr/miniconda3/envs/myenv/lib/python3.7/site-packages/phylophlan/phylophlan_configs/’, ‘output_folder’: ‘’, ‘genome_extension’: ‘.fna’, ‘proteome_extension’: ‘.faa’, ‘update’: False, ‘verbose’: True}
Loading configuration file “supermatrix_aa.cfg”
Checking configuration file
Checking “/home/usr/miniconda3/envs/myenv/bin/diamond”
Checking “/home/usr/miniconda3/envs/myenv/bin/mafft”
Checking “/home/usr/miniconda3/envs/myenv/bin/trimal”
Checking “/home/usr/miniconda3/envs/myenv/bin/FastTreeMP”
Checking “/home/usr/miniconda3/envs/myenv/bin/raxmlHPC-PTHREADS-SSE3”
[e] database format ("/home/usr/phylophlan_databases/phylophlan/phylophlan.faa", “/home/usr/phylophlan_databases/phylophlan/phylophlan.faa.bz2”, or “/home/usr/phylophlan_databases/phylophlan”) not recognize
(myenv) usr@srvr:~$

P.P.S.:
Just for your information: I have replaced the real names of the server, the user and the input folder with “srvr”, “usr” and “11assemblies”, respectively. However, all of the real names only consist of letters (small and big), numbers and underscores, no whitespace or points, and they are not overly long or complicated either.

P.P.P.S.:
(a)
The configuration files were created by running the phylophlan_write_default_configs.sh and are located in /home/usr/
e.g.
/home/usr/supermatrix_nt.cfg

For my phylogeny, I have chosen the Supermatrix pipeline rather than the Supertree pipeline (as recommended in the PhyloPhlAn user manual (version Jun 3, 2020, supermatrix-or-supertree-approach).
Also, I have chosen ‘supermatrix_nt.cfg’ rather than ‘supermatrix_aa.cfg’ because my input consists of whole-genome assemblies (nucleotide sequences) rather than translated protein sequences.

(b)
The assemblies in the /home/usr/data/input/11assemblies folder are mostly publicly available whole-genome assemblies downloaded from GenBank, such as e.g.
A_prevotii_GCF_000024105.1_ASM2410v1_genomic.fna

(downloaded from
NCBI Assembly
as
GCF_000024105.1_ASM2410v1_genomic.fna
and renamed to
A_prevotii_GCF_000024105.1_ASM2410v1_genomic.fna)

Hello Michael!

I think I found the problem, thank you for your patience and for reporting this.

With the latest commit in the GitHub repository, this should be now fixed: GitHub - biobakery/phylophlan: Precise phylogenetic analysis of microbial isolates and genomes from metagenomes
From the repo, you can download the phylophlan.py Python script and you can run it (with your phylophlan conda env activated) with ./phylophlan.py and use the same params as above.

Please, let me know if this fix is working for you.

I just wanted to add a small note about the config file:

The nt or aa in the config filename actually refers to the database, so, if you’re gonna use the phylophlan database you’ll need the supermatrix_aa.cfg config file which contains the instructions for indexing an amino acids database and also the commands for the translated search, since your inputs are genomes. By default this config will then build a phylogeny based on the AAs MSA, instead, if you want the phylogeny to be built on nucleotides, you would specify the --force_nucleotides param when both creating the config file and when running PhyloPhlAn. Note that this will only work if your inputs are all genomes and it won’t work if your inputs are a mix of genomes and proteomes.

Many thanks,
Francesco

Hello Francesco,

Thank you very much again for your quick reply and efforts!

I’ll get back to you as soon as I’ve checked if the fix was working.

Many thanks,
Michael

Hello Francesco,

Having run the phylophlan.py script, there is still a (little) issue.

Running this command…

…I got the following:
(base) usr@ srvr:~$ conda activate myenv
(myenv) usr@ srvr:~$ phylophlan -i /home/usr/data/input/11assemblies -d phylophlan --databases_folder /home/usr/phylophlan_databases --diversity medium -f supermatrix_nt.cfg --nproc 8
[e] both db_dna and db_aa are None!
(myenv) usr@srvr:~$

Also, I have noticed that the folder in which I have put the manually downloaded database including the checksum file
/home/usr/phylophlan_databases/phylophlan.tar
/home/usr/phylophlan_databases/phylophlan.md5

now contains another folder
/home/usr/phylophlan_databases/phylophlan

with another zipped file
/home/usr/phylophlan_databases/phylophlan/phylophlan.faa.bz2

Would be great if you could let me know if you see where the problem is.

Many thanks,
Michael

Hi Michael,

Thanks, that’s great. The folder is the actual database folder, and it should contain the zip file, the same file uncompressed and it indexed version with the mapping software specified in the configuration file.

The error you got now:

is likely because the config file does not contain the proper sections for generating the indexed version of the database. From the filename, it seems you’re using the default config file for a nucleotide/genes database (supermatrix_nt.cfg), so that might explain the error. Since the phylophlan database is a set of 400 universally conserved proteins, you should use the default config file generated for a protein database: supermatrix_aa.cfg.
If you don’t have such a config file it should be pretty easy to generate one with the phylophlan_write_config_file utility.

Please let me know if something Is not clear.

Many thanks,
Francesco

Hi Francesco,

Thank you very much again for your swift and patient reply.

Sorry for having overread the explanations about the nt vs. aa databases you had included already in your previous answer, and it makes totally sense to use the protein database as I want PhyloPhlAn to calculate a protein (not nucleotide) based phylogenomic tree.

Most importantly, with this little correction the command was working and PhyloPhlAn is running. :slight_smile:

Grazie mille and many thanks again,
Michael

1 Like

I seem to be having a similar problem, but the solutions in the thread haven’t worked for me. I’m working on a cluster with no internet access, so I downloaded the phylophlan database manually and copied them into /home/phylophlan. I generated the config file as such:

phylophlan_write_config_file \
    -o custom_config_aa.cfg \
    -d a \
    --db_aa diamond \
    --map_aa diamond \
    --map_dna diamond \
    --msa muscle \
    --trim trimal \
    --tree1 fasttree \
    --tree2 raxml \
    --force_nucleotides

and I then executed the following command:

phylophlan \
    -i /PATH/TO/MAGS/ \
    -o /PATH/TO/MAGS/phylophlan_out \
    --genome_extension ".fasta" \
    -d phylophlan \
    --databases_folder /home/phylophlan \
    -t a \
    --diversity medium \
    -f custom_config_aa.cfg \
    --nproc 20 \
    --force_nucleotides \
    --verbose

However, I get the following output and error:

PhyloPhlAn version 3.0.67 (24 August 2022)

Command line: /home/bin/phylophlan -i /PATH/TO/MAGS/ -o /PATH/TO/MAGS/phylophlan_out --genome_extension .fasta -d phylophlan --databases_folder /home/phylophlan -t a --diversity medium -f custom_config_aa.cfg --nproc 20 --force_nucleotides --verbose

Automatically setting "input=refined" and "input_folder=/PATH/TO/MAGS/"
"medium-accurate" preset
Setting "sort=True" because "database=phylophlan"
Setting "min_num_markers=100" since no value has been specified and the "database=phylophlan"

Arguments: {'input': 'MAGS', 'clean': None, 'output': '/PATH/TO/MAGS/phylophlan_out', 'database': 'phylophlan', 'db_type': 'a', 'config_file': 'phylophlan_configs/custom_config_aa.cfg', 'diversity': 'medium', 'accurate': True, 'fast': False, 'clean_all': False, 'database_list': False, 'submat': 'pfasum60', 'submat_list': False, 'submod_list': False, 'nproc': 20, 'min_num_proteins': 1, 'min_len_protein': 50, 'min_num_markers': 100, 'trim': 'gap_trim', 'gap_perc_threshold': 0.67, 'not_variant_threshold': 0.99, 'subsample': <function onehundred at 0x2aae845068c0>, 'unknown_fraction': 0.3, 'scoring_function': <function trident at 0x2aae84506c20>, 'sort': True, 'remove_fragmentary_entries': False, 'fragmentary_threshold': 0.85, 'min_num_entries': 4, 'maas': None, 'remove_only_gaps_entries': False, 'mutation_rates': False, 'force_nucleotides': True, 'convert_N2gap': False, 'input_folder': '/PATH/TO/MAGS/', 'data_folder': '/PATH/TO/MAGS/phylophlan_out/tmp', 'databases_folder': '/home/phylophlan', 'submat_folder': '/home/.local/lib/python3.10/site-packages/phylophlan/phylophlan_substitution_matrices/', 'submod_folder': '/home/.local/lib/python3.10/site-packages/phylophlan/phylophlan_substitution_models/', 'configs_folder': 'phylophlan_configs/', 'output_folder': '', 'genome_extension': '.fasta', 'proteome_extension': '.faa', 'update': False, 'verbose': True}

Loading configuration file "phylophlan_configs/custom_config_aa.cfg"
Checking configuration file
Checking "/home/bin/diamond"
Checking "/home/bin/muscle"
Checking "/home/bin/trimal"
Checking "/home/bin/FastTreeMP"
Checking "/home/bin/raxmlHPC-PTHREADS-SSE3"
[e] database format ("/home/phylophlan/phylophlan/phylophlan.faa", "/home/phylophlan/phylophlan/phylophlan.faa.bz2", or "/home/phylophlan/phylophlan") not recognize

Hi @Keaton_Stagaman1, can you post the content of the database folder /home/phylophlan?
Also, when you said that you

downloaded the phylophlan database manually

Can you be more precise on the process? If you did the download “manually”, I would probably suggest running PhyloPhlAn so that it does the download, uncompression, etc. from a machine with internet access, then kill the process once the database setup is done and re-submit the jobs to the cluster.

Thanks,
Francesco