Hello. I am trying to create a set of core proteins as a database instead of using the UniRef90 - the species I and studying is not there. I’ve already used the default phylophlan database, but I want to make my own. How can I do this?
Hello and thanks for using PhyloPhlAn!
To build your own database you can try following the instructions available here: PhyloPhlAn wiki - Database setup.
Basically, you’ll still need to use the
phylophlan_setup_database script, but providing your own file (or folders with the gene files) instead of the automatic download of UniRfe90.
Please let me know if something is not clear.
Hi! I meant how can I get a set of genes that are markers to put in the database.
Hi, to do that you need to use tools like prokka and roary, where the first annotates your genomes and the second computes the set of core proteins from the gene annotations. Then you can build a custom db for PhyloPhlAn using the core genes identified by Roary.
I hope this helps, thanks,
Hello Francesco. Yes that helps, I just read this in the paper too.
Right now I am using the default phylophlan database, but would you agree it would make a “better” tree to make a custom db of markers, if looking at a single species?
Hello Ana. Yes, the
phylophlan database is a set of 400 universal proteins, so they might not be specific enough to accurately resolve closely related genomes, as in your case.
I don’t know what species you’re studying, but alternatively, to the “prokka+roary” pipeline, one thing you could try is to download the UniRef90 of the species in the same genus as yours, then make a db for PhyloPhlAn, and then set the the
--min_num_entries param in PhyloPhlAn to use only those that are found in “enough” genomes (basically this will be a coreness threshold for the markers in the db).