Clades missing in phylogenetic tree

I had 8 bins and downloaded their closely related 12 genomes from Genbank and performed the phylophlan to construct phylogenetic tree. It successfully generated the .tre file and when I visualized the tree with itol tool, I get only few clades (few bins and few genomes) whereas I was expecting to have all 20 clades (8+12).

What could be the reason? Please help!

Hi, one main reason could be the quality control of the actual markers mapped during the PhyloPhlAn analysis. Do you have the output of PhyloPhlAn? (I can help you to check for those that were discarded and for what reason).
Also here it would play an important aspect (i) which database you used and (ii) with a protein database (like -d phylophlan) whether you built the phylogeny using amino acids or nucleotides. If you have the full command line and/or the output from PhyloPhlAn (ideally with the --verbose param) we can see what parameters can be tuned to avoid discarding too many of your inputs.

phylophlan_output.txt (194.7 KB)

Please see attached the output from phylophlan run.

And I wrote the configuration file as below:

phylophlan_write_config_file -d a -o genomes.cfg --db_aa diamond --map_dna diamond --map_aa diamond --msa mafft --trim trimal --tree1 fasttree --tree2 raxml --verbose

Hi, thanks for the output and details.

Ten genomes were discarded because they didn’t meet the at least 100 universal proteins that is automatically set when using the phylophlan database:

Not enough markers (90/100) found in "output_2/tmp/map_dna/bin79.b6o.bz2"
Not enough markers (59/100) found in "output_2/tmp/map_dna/GCA017993255.1.b6o.bz2"
Not enough markers (84/100) found in "output_2/tmp/map_dna/GCA017116435.1.b6o.bz2"
Not enough markers (85/100) found in "output_2/tmp/map_dna/bin10.b6o.bz2"
Not enough markers (69/100) found in "output_2/tmp/map_dna/GCA003521755.1.b6o.bz2"
Not enough markers (57/100) found in "output_2/tmp/map_dna/GCA012513575.1.b6o.bz2"
Not enough markers (67/100) found in "output_2/tmp/map_dna/GCA022563395.1.b6o.bz2"
Not enough markers (91/100) found in "output_2/tmp/map_dna/GCA018902835.1.b6o.bz2"
Not enough markers (70/100) found in "output_2/tmp/map_dna/GCA019454465.1.b6o.bz2"
Not enough markers (75/100) found in "output_2/tmp/map_dna/GCA021414205.1.b6o.bz2"

Then another one is discarded later:

Not enough markers (91/100) found in "output_2/tmp/map_aa/GCA020429705.1.b6o.bz2"

because your inputs are genomes and the phylophlan database is proteins and since you didn’t specify the --force_nucleoides (both in the config and the PhyloPhlAn command), PhyloPhlAn is then re-annotating your inputs as proteomes. So, in this case, the proteome for GCA020429705.1 is not reaching the threshold of 100 universal markers.

So, in the end, your tree will have 11 of the original 22 inputs.

You can lower the --min_num_markers default values by specifying it in the command line with 50 as value. This should allow you to retain all of your input genomes.
Be careful though to double-check the MSA as lowering this too much can bias branch lengths due to too many missing markers.

Many thanks,
Francesco

Yes, that works. I understand it now. Thanks a lot.