Phylophlan - creating database of markers

Hello. I am trying to create a set of core proteins as a database instead of using the UniRef90 - the species I and studying is not there. I’ve already used the default phylophlan database, but I want to make my own. How can I do this?

Hello and thanks for using PhyloPhlAn!

To build your own database you can try following the instructions available here: PhyloPhlAn wiki - Database setup.
Basically, you’ll still need to use the phylophlan_setup_database script, but providing your own file (or folders with the gene files) instead of the automatic download of UniRfe90.

Please let me know if something is not clear.

Many thanks,

Hi! I meant how can I get a set of genes that are markers to put in the database.

Hi, to do that you need to use tools like prokka and roary, where the first annotates your genomes and the second computes the set of core proteins from the gene annotations. Then you can build a custom db for PhyloPhlAn using the core genes identified by Roary.

I hope this helps, thanks,

Hello Francesco. Yes that helps, I just read this in the paper too.

Right now I am using the default phylophlan database, but would you agree it would make a “better” tree to make a custom db of markers, if looking at a single species?

Hello Ana. Yes, the phylophlan database is a set of 400 universal proteins, so they might not be specific enough to accurately resolve closely related genomes, as in your case.

I don’t know what species you’re studying, but alternatively, to the “prokka+roary” pipeline, one thing you could try is to download the UniRef90 of the species in the same genus as yours, then make a db for PhyloPhlAn, and then set the the --min_num_entries param in PhyloPhlAn to use only those that are found in “enough” genomes (basically this will be a coreness threshold for the markers in the db).


I’m trying to use Roary to make a database of markers. It looks like Roary will make an multi-fasta alignment. Can this be used as a database? Do you know how I can just get the sequences and not the alignment? Thanks in advance.

Hi Ana,
from Roary you should also have a folder with all genes identified in the pangenome. What you can do is to get only those that are “core” (and here you can decide which % threshold to use) and put them in a separate folder. At that point, you can run phylophlan_setup_database on that folder to build a database formatted for PhyloPhlAn and then you can run phylophlan specifying your custom database of core genes.

Alternatively, you can take the core_gene_alignment.aln, remove all the gaps (-) added by the MSA and run phylophlan_setup_database on the unaligned multi-fasta file.

I hope this helps and let me know if something doesn’t work.

Many thanks,

Yes, that helps. Thank you.

I have a big problem now. Since I installed Prokka and Roary, Phylophlan is no longer working on my Mac (it was working before I installed these programs). I get an error at the mapping stage using diamond - it said something the database and version are not compatible.

[e] Command ‘[’/Users/MyComputer/miniconda3/envs/phylophlan/bin/diamond’, ‘blastx’, ‘–quiet’, ‘–threads’, ‘1’, ‘–outfmt’, ‘6’, ‘–more-sensitive’, ‘–id’, ‘50’, ‘–max-hsps’, ‘35’, ‘-k’, ‘0’, ‘–query’, ‘phylophlan_output/tmp/clean_dna/VICT1.fasta’, ‘–db’, ‘phylophlan_databases/phylophlan/phylophlan.dmnd’, ‘–out’, ‘phylophlan_output/tmp/map_dna/VICT1.b6o.bkp’]’ returned non-zero exit status 1.

[e] gene_markers_identification crashed

Thank you,

Likely the diamond version changed. What you can do is to remove the diamond indexed database (file with .dmnd extension) from the PhyloPhlAn databases folder and re-run PhyloPhlAn. At that point, PhyloPhlAn will re-indexed the database using the new version you installed and everything should work.

Many thanks, Francesco