I have ran my samples through StrainPhlAn3 and calculated distance matrices using Emboss (with Kimura correction as per the tutorial).
My question is whether there is a consensus on what to use as an identity cutoff when looking to see if the same strain is present in two different samples? Should I look for a distance of zero or should this be relaxed to allow for sequencing errors?
Thanks in advance,
If you are interested in detect the same strain in different samples I would suggest you to use the normalized phylogenetic distances retrieved from the branch lenghts of the phylogenetic tree (normalizing the branch distances by the total branch length). In terms of the cutoff, this will slightly differ depending on the species you are interested on, but 0.01 would be a good approximation (https://www.nature.com/articles/s41467-020-18127-y#Sec4)
Thanks again for the reply. As a follow up question:
By total branch length do you mean the sum of all edges in the tree or the sum of branch length from all tips to the root?
I have also seen another normalisation method suggested based on dividing the phylogenetic distance between each pair of leaves by the median of these distances.
All the best,
I meant the sum of all the branch lengths in the tree. If you are using Python, you could use the total_branch_length() function of the Phylo.BaseTree included in Biopython: https://biopython.org/docs/1.75/api/Bio.Phylo.BaseTree.html#Bio.Phylo.BaseTree.TreeMixin.total_branch_length
For the other normalisation method using the mean value, that will also work (https://www.sciencedirect.com/science/article/pii/S1931312818303172)