Many thanks for the explanation! Here I provide an example of me having more “N” and “*” in the consensus markers than the “final alignment file”.
The consensus markers had 4926 "*"s and no Ns. The corresponding “>SRS058770_rep_core_selected” had 7648 Ns. Also the total length of the consensus makers was 496194 and the final alignment was only 33054, meaning that the "N"s were significantly enriched for when going from the consensus markers to the final alignment. I did swap the marker genes for the s__Parabacteroides_distasonis in the database with my own (and also edited the .pkl file according to the instructions on the metaphlan biobakery forum)
The code to go from the consensus markers to the final s__Parabacteroides_distasonis.StrainPhlAn3_concatenated.aln was
strainphlan -s consensus_markers/SRS058770_rep_core_selected.pkl -m PD_core_db/PD.rep.core.consensus.selected.fna -d PD_core_db/PD_rep_core_consensus_selected.pkl -r genomes/H111_Di.fa genomes/GUT_GENOME001075.fa genomes/GUT_GENOME001225.fa genomes/H199_Di.fa genomes/aa_0143_0055_c11_final.fa -o SRS058770_rep_output_core_selected_5g -n 8 -c s__Parabacteroides_distasonis
I am assuming that whether a position would eventually be called a ‘N’ be totally be dependent on the number of "*"s in the consensus markers since the polymorphism calling that you describe seem all be applied before the consensus markers are generated? I am also attaching the relevant files for the problem I describe above. I exported the consensus markers out of the .pkl file as a fasta since it may be easy to run into python version issues w/ the .pkl file.
s__Parabacteroides_distasonis.StrainPhlAn3_concatenated.txt (197.0 KB)
SRS058770_rep_core_selected.txt (508.1 KB)
Thanks a lot!!! Annie