Question About StrainPhlAn3 Polymorphic Output File

Hi,

Could you please provide details on the columns/information presented in the clade.polymorphic file produced by StrainPhlAn3?

The column names appear to be different than the columns of the previous analogous file produced by StrainPhlAn2.

Thank you!
Rauf

Hi @raufs
The new polymorphic file contains the following columns per sample:

  • sample: The name of the sample
  • percentage_of_polymorphic_sites: the percentage of polymorphic sites concatenating all markers together
  • avg_by_marker: the average percentage of polymorphic sites along all markers
  • median_by_marker: the median (Q2) percentage of polymorphic sites along all markers
  • std_by_marker: the standard deviation of the percentage of polymorphic sites along all markers
  • min_by_marker: the minimum percentage of polymorphic sites along all markers
  • max_by_marker: the maximum percentage of polymorphic sites along all markers
  • q25_by_marker: The first quartile (Q1) percentage of polymorphic sites along all markers
  • q75_by_marker: The third quartile (Q3) percentage of polymorphic sites along all markers

Best,
Aitor

Thank you for the column descriptions Aitor!

And just to confirm, is a polymorphic site one where the sample differs from the reference marker’s sequence or is a site at which there is ambiguity for a base call within the sample, which could arise due to multiple strains being present in the metagenomic sample, for instance?

Kind regards,
Rauf

Hi @raufs
It is the second case, when reconstructing the markers sequences using CMSeq (https://github.com/SegataLab/cmseq), a polymorphic site is called if the frequence of the dominand allele is lower than 80%.

Best,
Aitor

Great, thank you again Aitor!

Much appreciated,
Rauf