One of the things we used PhyloPhlAn 2 for was to predict the genus/species for new MAGs. This was output as a side product of placing MAGs in the tree of life, using the database of portein markers in the PhyloPhlAn database.
The new phylophlan_metagenomic script replaces this functionality, but it appears it can only use SGB databases. This gives us terrible resolution for our (rumen) MAGs, often going only to Phylum and sometimes to Family (but very very rarely to Genus or Species)
Is there a way of using phylophlan_metagenomic with the old protein biomarker database, so as to replicate the behaviour of PhyloPhlan2?
Alternatively, is there a way of asking PhyloPhlAn to output tabular taxonomy predictions when it places genomes in the tree of life?
Hello @BioMickWatson, I’m sorry that functionality is not available in PhyloPhlAn 3.0. The main reason why we decided to remove it is because it was based on the muscle ability to merge MSAs. During our tests, we found several cases where the merged MSA had biases, potentially because of the difficult task of merging MSAs in an accurate way. Also, by moving the external tool configurations of PhyloPhlAn to a config file to allow to integrate more tools available and be flexible with their parameters, then we couldn’t easily provide this functionality. Not only because of the availability os several external tools that one can use, but also due to the many different configurations a user can set in the analysis.
Having said this, one can setup an analysis that ca replicate what PhyloPhlAn 2 did when integrating inputs. This would require to:
retrieve a set of reference genomes to cover the diversity (this can be done using ``).
Then one can use PhyloPhlAn 3.0 to build a phylogeny using the phylophlan database.
Now, one can create a new input folder linking inside all the reference genomes used in the tree-of-life phylogeny from the previous step + the new genomes and MAGs to be placed.
To save time one can create the new output folder (let’s say it will be output1 and copy inside output1/tmp the folder: map_dna and markers_dna (to avoid re-mapping and re-extract the markers in the phylophlan database already computed in step 2)
Now a new tree of life with the new inputs can be reconstructed. Since some data were copied from the previous tree of life, the very same parameters and configuration file should be used.
I’ll be happy to further help with this if something is not clear.
I want to know how to retrieve the reference genomes if we don’t have the taxonomic information about the MAGs that we have. what command should I use? Does it download references for all the genomes in the database?
while building the phylogeny what input shud we provide in the input folder? just the MAGs in .fasta format or the reference genomes dowloaded too with the MAGs? and shall I keep both reference genomes and MAGs in the same input folder?
when should I create the new output folder I mean during running which command I should give the new output folder?
what is the command for the tree of life?
I am really confused because none of the tutorials has the exact commands that I need and I find them very ambiguous. I have generated new topics please help me if you are able to understand my problem. I want to get a graph like the one below that I have posted. I want the taxonomic labels for each bin as well as their phylogenetic tree in the graph.
Hi @saras22, I’ll try to answer your questions, please do let me know if something is still not clear.
Ok, this has two alternatives, in my opinion. In the first option, you use phylophlan_metagenomic to assign your MAGs to existing SGBs and if they are kSGBs you know the species label, so you know what reference genomes to download. If your MAGs are assigned to uSGBs, then the taxonomic label of the SGBs can have 3 different levels of assignment, so you have 3 scenarios:
genus: you know the genus of the SGB your MAGs are assigned to but not the species, so you can use phylophlan_get_reference to download up to a certain number of reference genomes (let’s say 10) for all species under that genus;
family: as the genus case only you just know the family, so you can use phylophlan_get_reference to download reference genomes for all species and genera under that family;
phylum: in this case, it means that the SGB is really unknown so we pick the phylum of the closest reference genomes. Potentially here you would like to explore the phylogenetic placement of your MAGs within a tree of life (second option below).
In the second option, you can build a tree of life with your MAGs and use the phylogenetic placement to infer the taxonomic assignment (see the answer to 4. for this).
The input folder should contain everything you would like to have in your tree. Here is a description of the input files.
The main message is that if you want a tree with both MAGs and reference genomes, put them in the same input folder. Pay attention that PhyloPhlAn uses the extension to discriminate genomes from proteomes (as it allows both in the input folder), so your MAGs and reference genomes should have the same extension (.fna, .fa, .fasta, etc.), it is not important which one, just it has to be the same. The default is .fna, if you use a different one you can specify it with the --genome_extension param.
Apologies, but I don’t understand here what you’re referring to. Can you make an example?