Marker gene identification within the multiple sequence alignment from StrainPhlAn

Jeffrey_Chiu · January 28, 2022, 8:26pm

Hi forum!
This question might only be relevant to a very specific set of StrainPhlAn users, but I was wondering how to best identify the order and positions marker genes within your consensus sequences in your multiple sequence alignment. Essentially, I want to identify the marker genes with genetic regions of high variability within my MSA and look into the functional annotation or potential of that marker gene to gain insight into the biology differentiating different bacterial strains. For example, if I find that they are high genetic variability at 1000-1015 bp of my consensus sequence in my MSA, is there a way to go about identifying the specific marker gene with which this region of high variability lies in? Furthermore, are there already annotations of each marker gene within the MetaPhlAn database or any paper I could reference?
Thank you! Any insight would be much appreciated!

aitor.blancomiguez · February 3, 2022, 10:14am

Hi @Jeffrey_Chiu
There is not an easy way to identify the positions in the final trimmed MSA that correspond to each specific marker, but you can use the --debug option to keep the temporal folder and check the content of the trim_not_variant folder. There, you will find the individual trimmed alignment for each marker. Instead, if you want to check the non-trimmed version of the alignments, they will be stored in the msas folder.
For the annotation, this is a little bit more tricky, since the name of the markers in the StrainPhlAn temporal folders are converted to random unique numeric identifiers to avoid problems with the alignment tools. But you could annotate the FASTA file in the s__Species_name/s__Species_name.fna (within the temporal folder) using tools like prokka (https://github.com/tseemann/prokka)
I hope this helps

fancyge · July 12, 2022, 2:04pm

Hi, I was going to ask quite similar questions regarding individual genes.

Is it possible to also keep the original individual genes (before aligning from mafft and trimming from trimal) for each sample?

I hope to use the original sequences to check codons of genes. I see msas contains untrimmed alignments, but it’s already aligned so I’m not able to get the codon information. I didn’t see where to add options to align while also keeping the codon frame. So I thought if I can get the original individual genes then I can align them my way.

Please could you give me insights on this. Thanks!

aitor.blancomiguez · July 18, 2022, 9:50am

Hi @fancyge
The unaligned reconstructed marker genes for each sample can be found in the markers_dna folder. I hope this is what you are looking for

fancyge · July 22, 2022, 8:57am

Thanks so much. This is very helpful!

Topic		Replies	Views
StrainPhlAn Usage for Bacterial SNP Annotation StrainPhlAn	2	209	May 23, 2024
Where do reference genome marker sequences come from? StrainPhlAn	3	589	June 28, 2022
Understanding StrainPhlAn for beginners StrainPhlAn	3	1821	September 29, 2021
Discrepancies between marker database and BLAST MetaPhlAn	5	283	July 20, 2022
Finding specific sequences in the database MetaPhlAn	2	370	March 3, 2023

Marker gene identification within the multiple sequence alignment from StrainPhlAn

Related topics