Diversity --high and --low parameter

Hello @f.asnicar !!

two questions regarding two different topics:

  • Topic 1: low and high diversity parameter:
    as mentioned in the tutorial that high-level diversity is good for tree of life construction, in my case I have 2736 bins and I want to visualize them in tree of life so shall I keep the diversity level high or low?

  • Topic 2: difference between kSGBs and uSGBs:
    I have done --phylophlan_metagenomic ana;ysis as per your suggestion that has given me taxonomic labels for all the bins. some of them have been assigned as kSGBs and some are uSGBs(unknown). I want to know the difference between the kSGBs and uSGBs. Can I say that the uSGBs are putative novel species?

  • Topic 3: How are the markers selected for the phylogenetic tree construction?
    I am trying to read the phylophlan paper but I am not able to understand clearly how are the markers selected for the tree construction, are these markers only based on 16srRNA genes, or they are based on all the housekeeping genes?

I have tried to assign the taxonomy to my 2736 bins using GTDBtk database which has used 62k genomes as the reference genomes and on the basis of ANI they have assigned the taxonomy. Does phylophlan also use the reference genomes for selecting the marker genes?

what I have understood is that phylophlan uses the species-specific marker genes for assigning the taxonomic label to each bin and to connect them to other genomes it uses the core marker genes. Please correct me if I am wrong.

How many known species can be classified using the --phylophlan_metagenomic analysis?

I am so thankful for your responses to my questions earlier and also I apologize for asking very silly questions every time.

Thanks :slight_smile:

Hello @saras22,

Topic 1: low and high diversity parameter:
as mentioned in the tutorial that high-level diversity is good for tree of life construction, in my case I have 2736 bins and I want to visualize them in tree of life so shall I keep the diversity level high or low?

So, there are two things here, one are the params combinations which automatically set up several parameters according to the expected diversity among the inputs, another is the type of phylogeny you want to build. In particular, if you want to visualize your 2,736 bins into the tree of life, you need to download all reference genomes to build a tree of life together with your bins. In this case, I would strongly suggest keeping the diversity high.

Topic 2: difference between kSGBs and uSGBs:
I have done --phylophlan_metagenomic ana;ysis as per your suggestion that has given me taxonomic labels for all the bins. some of them have been assigned as kSGBs and some are uSGBs(unknown). I want to know the difference between the kSGBs and uSGBs. Can I say that the uSGBs are putative novel species?

uSGBs are clusters defined only by MAGs and into which there are no reference genomes. These are representing potential species not described by any reference genome.

Topic 3: How are the markers selected for the phylogenetic tree construction?
I am trying to read the phylophlan paper but I am not able to understand clearly how are the markers selected for the tree construction, are these markers only based on 16srRNA genes, or they are based on all the housekeeping genes?

I think you’re referring to the 40 universal markers. They were selected in the first PhyloPhlAn paper. They are representing several conserved genes in bacteria.

I have tried to assign the taxonomy to my 2736 bins using GTDBtk database which has used 62k genomes as the reference genomes and on the basis of ANI they have assigned the taxonomy. Does phylophlan also use the reference genomes for selecting the marker genes?

I think this question is linked with the previous one, marker genes were originally selected among available reference genomes. But I’m not sure how markers relate to the taxonomic assignment comparison between GTDB and PhyloPhlAn.

what I have understood is that phylophlan uses the species-specific marker genes for assigning the taxonomic label to each bin and to connect them to other genomes it uses the core marker genes. Please correct me if I am wrong.

For the taxonomic assignment, PhyloPhlAn uses the sketch indexes from Mash from all genomes and MAGs previously assigned to an SGB. So, it doesn’t use SGB-specific markers for this task.

How many known species can be classified using the --phylophlan_metagenomic analysis?

This somehow depends on the SGB database, as the addition of MAGs and reference genomes can slightly change the number of u and kSGBs. More or less though, there are about 1/3 of the SGBs that are kSGBs and 2/3 that are uSGBs.

Many thanks,
Francesco