So this may be highlighting a deeper misunderstanding I have with the StrainPhlAn process, but how are the marker gene sequences from the reference genomes found? I was having a closer look at the alignments because I was surprised at how far away the reference genome (Bifidobacterium_breve) I inputed was from most of my samples. I noticed that I can’t actually find where the marker sequences in the concatenated alignment file for the reference genome are located in the original reference genome I inputted.
For the reference sequences, StrainPhlAn relies on PhyloPhlAn to retrieve the marker sequences. Internally, PhyloPhlAn will map (by default using BLASTn) the marker gene sequences from the MetaPhlAn database against the reference genome contigs to identify and extract them.
Notice that the concatenated alignment file produced in the output folder has already been trimmed (by removing non variant and gappy positions of the alignment) so it is practically impossible to map back to the reference contigs. However, if you run StrainPhlAn with the parameter
--debug, you will keep the temporal folders generated within the execution. Within the temporal folder, you can explore the
markers_dna folder and find the marker sequences for each of the reference genomes and samples.
Thanks so much for the reply. So are the regions shown in the alignment only the variable regions? Does that affect the calculation of the distance measures in the tree?
Yes, only the variable (in more than 1% of the samples) and non gappy (less than 33% of gaps) positions are kept. As evaluated in the PhyloPhlAn 3 paper (Precise phylogenetic analysis of microbial isolates and genomes from metagenomes using PhyloPhlAn 3.0 | Nature Communications) this will not affect the distance measures.