I was trying out to get the phylogenetic tree for the MAGs generated from my metagenomic datasets. The command that I have used is :
phylophlan --input PHYLO_IN/ -d phylophlan -t a --databases_folder PHYLO_DB/ --diversity high --output_folder PHYLO_OUT -f PHYLO_OUT/supermatrix_aa.cfg --genome_extension fasta
and the final results I got are :
PHYLO_IN.tre, PHYLO_IN_resolved.tre, PHYLO_IN_resolved.tre, RAxML_bestTree.PHYLO_IN_refined.tre, RAxML_info.PHYLO_IN_refined.tre, RAxML_info.PHYLO_IN_refined.tre, RAxML_log.PHYLO_IN_refined.tre, RAxML_result.PHYLO_IN_refined.tre
What is the difference between these different tre files?
I tried to visualize the tree in ITOL web tool, but that does not give any taxonomic information of the bins. how do I get that?
You can find a description of the outputs in the documentation here.
In brief, the RAxML_bestTree.PHYLO_IN_refined.tre should be your final phylogeny, in the above example.
The taxonomic information is not something you’ll get out of the tree if you don’t have other genomes in it with a known taxonomic label assigned to them.
What you can do alternatively is to run phylophlan_metagenomic that will provide you with the closest species-level genome bins (SGBs) to your input MAGs, so that you can understand whether your MAGs belong or not to an already existing SGB.
Thanks for all the responses you made earlier to all the questions I had. Now I have a few more.
I want to confirm if I am doing everything correctly so I am mentioning all the steps that I have done for making the tree:
installed phylophlan
downloaded phylophlan config files
tree command:
phylophlan --input /lustre/rsharma/PHYLO_ANALYSIS/ALL/ -d phylophlan -t a --databases_folder /lustre/rsharma/PHYLO_ANALYSIS/PHYLO_DB/ --diversity high --output_folder /lustre/rsharma/PHYLO_ANALYSIS/OUTPUT/ -f /lustre/rsharma/PHYLO_ANALYSIS/OUTPUT/supermatrix_aa.cfg --genome_extension .fa --force_nucleotides --nproc 50
My MAGs are in .fa format but the config file that I am using is in “aa” format, is that alright or I need to change anything? when I tried to make database using the “nt” config file it was not making the database and giving some error so I tried with this one and it started to run.
other question:
Can I use some other tool for obtaining taxonomy like GTDBTK and then use ITOL to label the phylogenetic tree with the species? Is this the correct way of visualizing the MAGs phylogeny and taxonomy as well? One more question is that if the above-mentioned way is correct then while making the phylogenetic tree what diversity level shall I choose (low, medium, or high)?
I’ll report the pieces so that it will be clearer to which point I’m answering to.
Instead of downloading config files, you can use PhyloPhlAn to generate what you need so to ensure that the tools as defined in your system are correctly matched.
Your inputs can be both genomes and proteomes, so no problem at all with that. The aa (or nt) I put in the config file name, is a convenient way for me to identify config tailored for genes/nucleotides (nt) or proteins/amino acids (aa) databases. This because only with protein databases the translated search is available and can deal with both genomes and proteomes as input.
Of course, you can use any other tool like those you mentioned. Alternatively, I can say that you can use phylophlan_metagenomic that will report the closest SGB found to your MAG and you can use this information to taxonomically characterize your MAGs. For visualization, we have GraPhlAn within bioBakery, which is very flexible but would require a bit of scripting to get colorful and annotated figures.
In the example above you’re running --diversity high and --accurate (the default if not specified). You can find a bit more info about the available combinations here: Home · biobakery/phylophlan Wiki · GitHub. There are some cutoffs that differ between --accurate and --fast, but won’t be too dramatic probably in your case.
I think the main difference is more on the expected diversity among the MAGs that you want to phylogenetically characterize. If you expect low genomic diversity, maybe medium and high will be too aggressive and might cut out a bit of the phylogenetic signal.