Hey,
I’ve installed phylophlan via conda (the latest version which was uploaded at March 31, 2020).
When I check the version by executing phylophlan -v it prints:
PhyloPhlAn version 0.43 (2 March 2020)
unlike the example which appears on the installation guide (PhyloPhlAn version 3.0 (1 April 2020))
When checking the version (by conda list) it is indeed seems to be phloyphlan3:
phylophlan version:3.0 build: py_1
So I’m not sure whether the output of phylophlan -v was not updated or that I have older version
Which version do I actually have? 0.43 or 3? if its an older one do you know to say when the newer will be available on conda?
You indeed installed PhyloPhlAn 3.0.
The 0.43 you got from phylophlan -v is the internal version for each of the PhyloPhlAn scripts.
These internal versions will likely be updated when the 3.1 version of PhyloPhlAn will be released, but we don’t have and eta at the moment.
Thanks Francesco,
I’m following the tree of life tutorial in order to build a phylogenetic tree with known bacterial genomes in addition to several genomes I’ve reconstructed from metagenomes.
Can you please explain what are the steps done to build the tree based on this example? I see that there are several steps of alignment and by the confing file I see that these includes mapping of amino-acids, but it would great for me to understand the full pipeline step by step.
Is it possible to start the process from the middle. (For example, pause the run after the mapping dna and to restart using the files in tmp/map_dna without re-map dna).
You can take the configuration file to understand the steps performed, which in brief are:
amino acids database creation using diamond ([db_aa] section)
mapping genomes using diamond blastx ([map_dna] section)
mapping proteomes using diamond blastp ([map_aa] section)
multiple-sequence alignments using mafft ([msa] section)
multiple-sequence alignments trimming using trimal ([trim] section)
phylogeny reconstruction using iqtree ([tree1] section)
Of course, in addition to the above steps, there are other internal PhyloPhlAn steps regulated by the --diversity high --fast combination of parameters, that you can find in the PhyloPhlAn user manual under the Accurate or Fast section.
Indeed it is possible to stop the PhyloPhlAn execution and re-started it from where it left. But when doing this you should check to not have incomplete temporary files that might cause crashes when restarting the analysis.
Thanks for the detailed response Francesco!
Actually I have several genomes from different species (both known and unknown) so I found this specific tutorial useful to understand the phylogenteic relationships of my MAGs across the bacterial tree of life.
As for 1- why there are two step of alignment, first genomes using blastx and then blastp? Is there a a process of identifying marker genes?
As for 2, just by executing again with the same parameters? Will it look for the existence of tmp files instead re-creating them?
Lastly, the phylophlan database is being downloaded to the new directory, is there a way to specify its location to avoid this step?
Hi, yes then it makes sense to have them into the tree of life to further characterize them. Alternatively you can use phylophlan_metagenomic to see if your unknown genomes and MAGs are assigned to any SGB. Here the SGB paper and the tutorial and manual for phylophlan_metagenomic.
As for your questions:
the two mapping steps are there in case your inputs contain both genomes and proteomes. So, if you only have genomes in the input folder, only the map_dna part will be used.
Yes, just re-run the same PhyloPhlAn command and PhyloPhlAn will detect the presence of the output and top folders and will check what is missing to compute and will recompute only the missing steps.
As I have written above, please make sure you don’t have any incomplete files as these may cause PhyloPhlAn to crash.
Yes, you can specify the location of the PhyloPhlAn databases using the --databases_folder param. Please have a look at the user manual if you need more details.
Blockquote
the two mapping steps are there in case your inputs contain both genomes and proteomes. So, if you only have genomes in the input folder, only the map_dna part will be used.
So actually this is the main point I’m trying to understand. The input of tree of life tutorial are genomes, why mapping to amino-acids is done there?
If a section is defined in the config file, but the criteria are not met, it is simply skipped.
So, if you only have genomes and don’t have proteomes, the map_aa section will not be used, and only one mapping (the map_dna) will be performed.
This, however, depends also on the parameters for PhyloPhlAn (the case whether or not the --force_nucleotides is specified). Because if your database is made of proteins and your input are genomes, PhyloPhlAn will first map the genomes, convert the portions that map into proteins and re-map the temp proteomes against the db. This is the case for tre of life phylogenies, as the phylogenetic signal on the amino acids is more robust than on nucleotides.