Understanding StrainPhlAn for beginners

Hi Biobakery team,

I read the StrainPhlAn paper quite a few times, but I still have couple of lingering questions about the StrainPhlAn3 that I hope to check with the team themselves. So I apologize in advance for these 2 simpleton questions.

  1. Would it be correct to understand that StrainPhlAn outputs MSA consisting of strain-specific consensus sequence. And each strain-specific consensus sequence represents the most dominant strain of a bacterial of interests from a sample. Therefore, given enough reads for a bacteria of interest in a sample, then StrainPhlAn can output 1 strain (most dominant) from that 1 sample.

  2. Is it also correct to understand that the generation of the strain-specific consensus sequence is essentially the concatenation the dominant markers found in a sample that satisfy the marker threshold (default: at least present in 80% of the samples with bacteria of interest)?


Hi @Jeffrey_Chiu
Answering to your questions.

  1. You are right, the consensus sequence represents the most dominant strain (for the selected species) in the sample, and they are generated using the pile-up of the reads that map against the species-specific markers. However, from the first version of StrainPhlAn (StrainPhlAn paper) to StrainPhlAn 3, there are few changes in relation to the output MSA. In particular, the most important is that StrainPhlAn 3 MSA only contains the “phylogenetically meaningfull” positions of the consensus sequences. This is, positions that have, at least, 1% of variability between the samples and less than 67% gaps (default parameters). So you could say that the output MSA consists in the strain-specific positions of the species-specific marker genes (markers genes = genes present in almost all the strains of the species but not in other species).
  2. For each sample, the consensus sequences are generated using the pile-up of the reads mapping against the marker genes. As different alleles could be found in each position, StrainPhlAn choose the allele with a >80% dominance (the dominant strain). Only consensus sequences with a breadth of coverage > 80% are kept. Then, for the MSA, by default, StrainPhlAn uses the consensus markers found in at least 80% of the samples and the samples with at least 20 markers. After this filtering, StrainPhlAn will select the strain-specific positions of the consensus sequences as explained in #1.
Hi Aitor,

Thank so much for the prompt reply. This is very clear and great to hear this directly from you. Following up on your answer to my 2nd question, what if there are well distributed alleles? Are these alleles not considered or included in the MSA? For example, at a position A, allele 1 to allele 2’s ratio is 1:1 with no clear dominance.

Hi @Jeffrey_Chiu, if that is the case, the position will be considered polymorphic and it will be masked with an N before computing the MSA

