Is percentage_of_polymorphic_sites the same as the term polymorphic rates?
Is this calculated as number of polymorphic sites/ number of all concatenated markers?
In strainphlan3, is the polymorphic_sites marked as “-“ in concatenated.aln and folder msas?
I got confused with in the strainphlan paper that “In the human gut, most species were represented by a single dominant strain because they show <0.1% of nucleotides on the species-specific markers that are polymorphic”, I guess this is not from percentage_of_polymorphic_sites, as in Figure 2C, percentage of strains with <0.1% of polymorphic sites has a wide distribution. Then how is this conclusion(most species were represented by a single dominant strain) made?
Please can you clarify. Thanks so much for the help.
Hi @fancyge
The percentage_of_polymorphic_sites is the percentage of bases in the consensus sequences considered polymorphic (i.e. by default, positions in which the allele dominance is bellow 80%). This is the same as the polymorphic sites percentage of Figure 2C in the original publication. In the strainphlan’s MSA, those positions will be masked as gaps (-)
Thanks very much for the reply. I also wanted to ask “In the human gut, most species were represented by a single dominant strain because they show <0.1% of nucleotides on the species-specific markers that are polymorphic”. Can you tell how this conclusion is made? Is this based on each sample or taking all the samples as a whole and calculate polymorphic site from the MSA?
In Figure 2C, the bottom part shows that percentage of strains in each species with <0.1% of polymorphic sites ranged from 0% to 75.76%, where I can’t see that species were represented by a single dominant strain.
Hi @fancyge
I was not part of the original work, but I understand that they are refering to Figure 1A, in which, accounting for all reconstructed markers in all samples assessed, less than 0.1% all the sites were considered polymorphic (with a min. allele dominance threshold of 80% to be considered a polymorphic site).
I guess you mean Figure 2A, and polymorphic (with a max. allele dominance threshold of 80% to be considered a polymorphic site). Is this correct?
Do you know how to deal with the “-” sites in the MSA when calculate the overall polymorphic rates?
Hi @fancyge
Yes, sorry, I meant Figure 2A and max. allele dominance.
For calculating the polymorphic rates, gaps and polymorphic positions are treated differently (if you inspect the sample2markers output pkl files you can notice that the polymorphic bases are masked with * while the gaps with - . Polymorphic bases are afterwards turned into gaps before the MSA.
Thank you!
So the " - " in the final MSA consists both polymorphic bases and gaps introduced in each sample. In this case, to calculate the overall polymorphic rates in the MSA, should I ignore “-” sites or treat them same weight as ACTG? Sorry I didn’t find such information in the paper. I only found this from the paper: “To summarize the polymorphic site probabilities at the species level (thus marking the probabilities of multiple sites and markers), we define a polymorphic species as a species having a polymorphic rate greater than the median and standard deviation of the polymorphic site across samples, respectively.” --which is more confusing to me. I can’t relate “a polymorphic species” with “species represented by a single dominant strain”. Initially, I wanted to check if species in my studied microbial communities were dominated by single species and compare that with gut results shown in this paper.
Thanks.
Hi @fancyge
The polymorphic rates are already pre-calculated by StrainPhlAn using the full set of markers available for the specific species in each sample. For that you can use the *.polymorphic files
Yes, I know from the *.polymorphic file, I can get the polymorphic rates for each sample, however, I wanted to get the polymorphic rates from all the samples using MSA as in Figure 2A. I thought this is the criteria to tell whether a certain species is represented by a single dominant strain. Thanks.
I would discourage you to try to get the polymorphic rates from the MSA. The procedure to build the MSA has change quite a lot since the first version of StrainPhlAn published in the original publication (now the final MSA of strainphlan is an highly trimmed version of the full markers set). Moreover, since the process of transforming polymophic positions into gaps is not reversible, it will be extremely difficult to know which positions of the MSA are actual polymorphisms, which are just uncovered positions or which are just gaps introduced in the alignment.
I understand. That makes sense. I’ll give up on using the polymorphic rates from the MSA and focus on analyzing per sample instead. Thank you very much.