Where do reference genome marker sequences come from?


So this may be highlighting a deeper misunderstanding I have with the StrainPhlAn process, but how are the marker gene sequences from the reference genomes found? I was having a closer look at the alignments because I was surprised at how far away the reference genome (Bifidobacterium_breve) I inputed was from most of my samples. I noticed that I can’t actually find where the marker sequences in the concatenated alignment file for the reference genome are located in the original reference genome I inputted.


Hi @sophieleech
For the reference sequences, StrainPhlAn relies on PhyloPhlAn to retrieve the marker sequences. Internally, PhyloPhlAn will map (by default using BLASTn) the marker gene sequences from the MetaPhlAn database against the reference genome contigs to identify and extract them.
Notice that the concatenated alignment file produced in the output folder has already been trimmed (by removing non variant and gappy positions of the alignment) so it is practically impossible to map back to the reference contigs. However, if you run StrainPhlAn with the parameter --debug, you will keep the temporal folders generated within the execution. Within the temporal folder, you can explore the markers_dna folder and find the marker sequences for each of the reference genomes and samples.