I am relatively new to StrainPhlAn. I tried to run StrainPhlAn across my data for the bacterium Staphylococus epidermidis with a permissive threshold of --sample_with_n_markers and --marker_in_n_samples (both set to 20).
Unfortunately, it looks as if the tree and MSA discarded a sample that, according to MetaPhlAn, had a 21% relative abundance of this species. I applied metaphlan’s clade profile option to observe the number of species-specific markers that were found within this sample and saw that the majority of species markers were indeed found in this sample.
In order to make sure that StrainPhlAn wasn’t discarding the markers present in this sample, I tried to run StrainPhlAn across 4 samples, with the --sample_with_n_markers and --marker_in_n_samples parameters set to 20, suggesting (to the best of my understanding) that a marker present in at least 1 sample should meet the threshold, and not be discarded, yet this still did not solve the issue.
Do you have any thoughts on why StrainPhlAn doesn’t present this sample despite it containing enough markers of the S.epidermidis species? Could it have something to do with the coverage of each marker? I did note that the average coverage of the markers in this specific sample was lower (less than 10) than that found for the same bacterium in other samples (which were identified by StrainPhlAn).
Thanks in advance for your help!
There could be different reasons why StrainPhlAn is not able to profile S. epidermidis in your sample, the most common are:
- The sequencing was too shallow and thus there is not enough depth of coverage for StrainPhlAn to accurately reconstruct 80% of the markers length
- Your sample contains several S. epidemidis strains in a similar abundance and thus StrainPhlAn is not able to accurately reconstruct the markers thus many polymorphisms
With this in mind, the best way to check what is the problem, is to inspect the
pkl file sample2markers.py generates. When open with python (using the pickle library: pickle — Python object serialization — Python 3.10.5 documentation) it contains a list with all the marker sequences reconstructed in your sample (with a breadth of coverage > 80%). Each element of the list contains, the marker name, the breadth of coverage and the sequence of the marker (containing
N in positions without enough depth of coverage and
* in the polymorphic positions, defining polymorphic position as a position in which the main allele dominance is bellow 80%). To retrieve which markers belong to your species, you can use the following file: http://cmprod1.cibio.unitn.it/biobakery3/metaphlan_databases/mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2
If you don’t find the S. epidermis markers in your pkl file, it means that one of the two scenarios had happened, and, to know which of them, you could execute sample2markers.py again on the sample but reducing the
-b parameter and then inspecting in the pkl file to check whether the problem is due to the depth of coverage (if you find an enrichment of
N positions) or due to multiple strains at the same abundance (if you find an enrichment of