I’m using strainphlan to identify how the same bacteria species from multiple hosts are related.
However, in the alignment, there are multiple "N"s. I’m trying to figure out what those N’s mean. One assumption is that there are multiple variants (SNPs) of the bacterial species within a host. Since the program selects one representative, it collapses all the variants into “N.” So the Ns are sites where multiple strains of the same bacterial species within a host vary.
OK, if the above is true, is it possible to extract all true variants within a species without collapsing? If the above isn’t true, can you explain the presence of “N”?
Hi @Binkybrown
Thanks for getting in touch,
The 'N's in the alignment can occur due 2 different scenarios:
A polymorphic position defined as a position with an allele dominance lower than 80%.
A position in the reconstructed marker that has not been covered by, at least, 8 reads.
As StrainPhlAn reconstruct the secuence of the dominant strain in the sample (in case there are more than one) those positions are converted to Ns in the case the dominant SNP is catching less than 80% of the reads. However, polymorphic positions can be also produced by sequencing errors.
Moreover, extracting “all the true variants within a species” is really tricky, since we cannot be sure which SNPs belong to which strain.
Thanks for the answers above. I seem to have run into the same issue, although I am still a bit confused how it happened given the two reasons you listed above. While I have a lot of Ns in my alignment output, I see very few Ns(or *s) in the .pkl file for consensus markers extracted from the metagenomes. I am also quite sure that >95% of the polymorphic positions (positions that are different from the reference markers) have major allele dominance >80% and are covered by more than 8 reads. I am thus quite confused how the Ns could have just appeared going from the consensus markers to the final alignment. I have also played around with the --marker_in_n_samples and the number of references/number of metagenome samples but no luck in getting rid of these Ns. Blasting the consensus markers against the final alignment also didn’t help (found almost no matching regions), which adds to my confusion because the alignments were supposed to be concatenations of the markers.
@cusoiv
Just for clarification, in StrainPhlAn, polymorphic positions are not “positions that are different from the reference markers” but just positions that have major allele dominance <80%. Those will be marked with * in the consensus markers.
Not being able to blast the consensus markers against the final alignment is something expected. Before concatenating all the markers in the final alignment, a quality control and trimming process is performed and, by default, only 500 of each marker are retained (you can see in the PhyloPhlAn tutorial all the operations performed for the “–diversity low + --fast” configuration: Home · biobakery/phylophlan Wiki · GitHub).
However, that you find more ‘N’ in your alignment that the ‘N’ + * in your markers is something unexpected. Could you provide me some examples?
Many thanks for the explanation! Here I provide an example of me having more “N” and “*” in the consensus markers than the “final alignment file”.
The consensus markers had 4926 "*"s and no Ns. The corresponding “>SRS058770_rep_core_selected” had 7648 Ns. Also the total length of the consensus makers was 496194 and the final alignment was only 33054, meaning that the "N"s were significantly enriched for when going from the consensus markers to the final alignment. I did swap the marker genes for the s__Parabacteroides_distasonis in the database with my own (and also edited the .pkl file according to the instructions on the metaphlan biobakery forum)
The code to go from the consensus markers to the final s__Parabacteroides_distasonis.StrainPhlAn3_concatenated.aln was
I am assuming that whether a position would eventually be called a ‘N’ be totally be dependent on the number of "*"s in the consensus markers since the polymorphism calling that you describe seem all be applied before the consensus markers are generated? I am also attaching the relevant files for the problem I describe above. I exported the consensus markers out of the .pkl file as a fasta since it may be easy to run into python version issues w/ the .pkl file. s__Parabacteroides_distasonis.StrainPhlAn3_concatenated.txt (197.0 KB) SRS058770_rep_core_selected.txt (508.1 KB)
Hi @cusoiv
Yes, you got it right, while the * represent polymorphic sites, that information is only used to calculate the polymorphic rates. After that, all the * positions (as well as the gaps) are transformed to Ns before the PhyloPhlAn execution.
Sorry for my mistake when I wrongly wrote “that you find more ‘N’ in your alignment that the ‘N’ + * in your markers is something unexpected” when it should be “that you find more ‘N’ in your alignment that the ‘-’ + * in your markers is something unexpected”.
Thanks a lot for the answer! I get it now; I think it is probably that not all positions in my marker genes were that well covered and I actually had a lot of ‘-’ s in my consensus markers that ended up as Ns. I tried using a smaller set of marker genes and the problem was resolved.