Marker gene identification within the multiple sequence alignment from StrainPhlAn

Hi forum!
This question might only be relevant to a very specific set of StrainPhlAn users, but I was wondering how to best identify the order and positions marker genes within your consensus sequences in your multiple sequence alignment. Essentially, I want to identify the marker genes with genetic regions of high variability within my MSA and look into the functional annotation or potential of that marker gene to gain insight into the biology differentiating different bacterial strains. For example, if I find that they are high genetic variability at 1000-1015 bp of my consensus sequence in my MSA, is there a way to go about identifying the specific marker gene with which this region of high variability lies in? Furthermore, are there already annotations of each marker gene within the MetaPhlAn database or any paper I could reference?
Thank you! Any insight would be much appreciated!

Hi @Jeffrey_Chiu
There is not an easy way to identify the positions in the final trimmed MSA that correspond to each specific marker, but you can use the --debug option to keep the temporal folder and check the content of the trim_not_variant folder. There, you will find the individual trimmed alignment for each marker. Instead, if you want to check the non-trimmed version of the alignments, they will be stored in the msas folder.
For the annotation, this is a little bit more tricky, since the name of the markers in the StrainPhlAn temporal folders are converted to random unique numeric identifiers to avoid problems with the alignment tools. But you could annotate the FASTA file in the s__Species_name/s__Species_name.fna (within the temporal folder) using tools like prokka (https://github.com/tseemann/prokka)
I hope this helps

1 Like