I’d be very interested to know whether you’ve had experience with samples where multiple inter-related species from the same genus are present. I am very interested in the streptococci in our oral samples, and we have Streptococcus mitis as dominant, but significant contributions from Strep. pseudopneumoniae, parasanguinis, infantis, oralis, pneumoniae and salivarius (in order of decreasing abundance).
I sense the potential for a few levels of problems. Firstly, the NCBI classification of these organisms (which I think you use) can be discordant with a more genetically informed lineage (GTDB) - or at the very least multiple genomes tagged as the same species by NCBI can get dotted around the tree by GTDB. Here I have indicated where genomes from the prebuilt pangenomes are placed by GTDB (r95):
Strep. pyogenes genomes are admixed with Strep. mitis, and Strep. mitis & Strep. oralis are intermixed.
There is also considerable sharing of COGs between some Streptococcal species. The Euler diagram shows that this is most notable for Strep. oralis and Strep. mitis.
Given that I know mitis dominates in my data, I wonder if the Strep. oralis pangenomes could come out badly. Panphlan could report pangenomes for samples where S. oralis is really below the reportable limit, but those COGs shared with S. mitis are easily detectable.
Homing in on specific questions:
- Have you considered generating pangenomes based on GTDB hierarchies at different resolutions?
- What might be the best way of mitigating against impacts of shared COGs between species which are both present in samples. I could consider removing the shared COGs from the S. oralis pangenome, as I won’t be able to trust the presence absence data. However I probably don’t need to worry about S. mitis as signal from its genes should dominate. Would this seem sensible, and how easy would it be to edit the S. oralis pangenome in this way?
Thanks again for such a useful tool!