Where do reference genome marker sequences come from?

sophieleech · June 8, 2022, 12:52am

Hi,

So this may be highlighting a deeper misunderstanding I have with the StrainPhlAn process, but how are the marker gene sequences from the reference genomes found? I was having a closer look at the alignments because I was surprised at how far away the reference genome (Bifidobacterium_breve) I inputed was from most of my samples. I noticed that I can’t actually find where the marker sequences in the concatenated alignment file for the reference genome are located in the original reference genome I inputted.

Thanks!

aitor.blancomiguez · June 8, 2022, 7:48am

Hi @sophieleech
For the reference sequences, StrainPhlAn relies on PhyloPhlAn to retrieve the marker sequences. Internally, PhyloPhlAn will map (by default using BLASTn) the marker gene sequences from the MetaPhlAn database against the reference genome contigs to identify and extract them.
Notice that the concatenated alignment file produced in the output folder has already been trimmed (by removing non variant and gappy positions of the alignment) so it is practically impossible to map back to the reference contigs. However, if you run StrainPhlAn with the parameter --debug, you will keep the temporal folders generated within the execution. Within the temporal folder, you can explore the markers_dna folder and find the marker sequences for each of the reference genomes and samples.

sophieleech · June 28, 2022, 7:10am

Hi,
Thanks so much for the reply. So are the regions shown in the alignment only the variable regions? Does that affect the calculation of the distance measures in the tree?

aitor.blancomiguez · June 28, 2022, 7:47am

Hi @sophieleech
Yes, only the variable (in more than 1% of the samples) and non gappy (less than 33% of gaps) positions are kept. As evaluated in the PhyloPhlAn 3 paper (Precise phylogenetic analysis of microbial isolates and genomes from metagenomes using PhyloPhlAn 3.0 | Nature Communications) this will not affect the distance measures.

Topic		Replies	Views
Marker gene identification within the multiple sequence alignment from StrainPhlAn StrainPhlAn	4	546	July 22, 2022
Length of concatenated alignment file StrainPhlAn	12	580	August 31, 2021
No alignments found to concatenate StrainPhlAn	5	507	June 30, 2022
Error related to step 3; reference genome StrainPhlAn	5	1195	November 23, 2020
Which folder should I put phy_database in? PhyloPhlAn	1	709	May 12, 2020

Where do reference genome marker sequences come from?

Related topics