Can I specify which reference genomes to download in PhyloPhlAn?

Hello bioBakers!

I am using Phylophlan v3.0.67 (24 August 2022) with a dataset of MAGs from enriched samples and trying to place retrieved genomes in a Phylogeny against references. As far as I understand, the phylophlan_get_reference command (as shown in example 04) selects reference genomes at random. The resulting phylogeny only places a small fraction (~10%) of my test MAGs among the references, showing the others in two distinct clades entirely separate from the reference genome collection. Do you think this is because:

  1. The reference genomes downloaded might be biased in some way (geographically, lab stains vs. clinical, etc?)
  2. The input MAGs have too many gaps to accurately infer their phylogenetic position (or does Phylophlan account for this, and if so, how?)

In case 1, is there some way I can specify a collection of references to download? Even if it means manually downloading reference genomes and specifying those as a reference within the program somehow?

Any advice would be greatly appreciated! :slight_smile:


Hello Archie,

With phylophlan_get_reference you can specify with the -g param, the taxonomic level from which you want to download the reference genomes. You can have a look at the tutorials to see some command examples, like: PhyloPhlAn 3.0: Example 01: S. aureus · biobakery/biobakery Wiki · GitHub.

The reference genomes should not be biased by some characteristic, as they were sorted according to their ‘quality’ from NCBI (i.e., RefSeq, redundancy, etc.).

I hope this helps.