Marker gene identification within the multiple sequence alignment from StrainPhlAn

Hi forum!
This question might only be relevant to a very specific set of StrainPhlAn users, but I was wondering how to best identify the order and positions marker genes within your consensus sequences in your multiple sequence alignment. Essentially, I want to identify the marker genes with genetic regions of high variability within my MSA and look into the functional annotation or potential of that marker gene to gain insight into the biology differentiating different bacterial strains. For example, if I find that they are high genetic variability at 1000-1015 bp of my consensus sequence in my MSA, is there a way to go about identifying the specific marker gene with which this region of high variability lies in? Furthermore, are there already annotations of each marker gene within the MetaPhlAn database or any paper I could reference?
Thank you! Any insight would be much appreciated!

Hi @Jeffrey_Chiu
There is not an easy way to identify the positions in the final trimmed MSA that correspond to each specific marker, but you can use the --debug option to keep the temporal folder and check the content of the trim_not_variant folder. There, you will find the individual trimmed alignment for each marker. Instead, if you want to check the non-trimmed version of the alignments, they will be stored in the msas folder.
For the annotation, this is a little bit more tricky, since the name of the markers in the StrainPhlAn temporal folders are converted to random unique numeric identifiers to avoid problems with the alignment tools. But you could annotate the FASTA file in the s__Species_name/s__Species_name.fna (within the temporal folder) using tools like prokka (https://github.com/tseemann/prokka)
I hope this helps

1 Like

Hi, I was going to ask quite similar questions regarding individual genes.

Is it possible to also keep the original individual genes (before aligning from mafft and trimming from trimal) for each sample?

I hope to use the original sequences to check codons of genes. I see msas contains untrimmed alignments, but it’s already aligned so I’m not able to get the codon information. I didn’t see where to add options to align while also keeping the codon frame. So I thought if I can get the original individual genes then I can align them my way.

Please could you give me insights on this. Thanks!

1 Like

Hi @fancyge
The unaligned reconstructed marker genes for each sample can be found in the markers_dna folder. I hope this is what you are looking for

1 Like

Thanks so much. This is very helpful!