/reposting from StrainPhlAn (posted there in error)
I’d be very interested to know whether you’ve had experience with samples where multiple inter-related species from the same genus are present. I am very interested in the streptococci in our oral samples, and we have Streptococcus mitis as dominant, but significant contributions from Strep. pseudopneumoniae , parasanguinis , infantis , oralis , pneumoniae and salivarius (in order of decreasing abundance).
I sense the potential for a few levels of problems. Firstly, the NCBI classification of these organisms (which I think you use) can be discordant with a more genetically informed lineage (GTDB) - or at the very least multiple genomes tagged as the same species by NCBI can get dotted around the tree by GTDB. Here I have indicated where genomes from the prebuilt pangenomes are placed by GTDB (r95):
Strep. pneumoniae genomes are admixed with Strep. mitis , and Strep. mitis & Strep. oralis are intermixed. I wonder if the four outliers could be phenotypically false positive Strep. pneumoniae (if that’s a thing), or atypical pneumococci (as per Genetic relationships between clinical isolates of Streptococcus pneumoniae, Streptococcus oralis, and Streptococcus mitis: characterization of “Atypical” pneumococci and organisms allied to S. mitis harboring S. pneumoniae virulence factor-encoding genes - PubMed)
There is also considerable sharing of COGs between some Streptococcal species. The Euler diagram shows that this is most notable for Strep. oralis and Strep. mitis .
Given that I know mitis dominates in my data, I wonder if the Strep. oralis pangenomes could come out badly. Panphlan could report pangenomes for samples where S. oralis is really below the reportable limit, but those COGs shared with S. mitis are easily detectable.
Homing in on specific questions:
- Have you considered generating pangenomes based on GTDB hierarchies at different resolutions?
- What might be the best way of mitigating against impacts of shared COGs between species which are both present in samples. I could consider removing the shared COGs from the S. oralis pangenome, as I won’t be able to trust the presence absence data. However I probably don’t need to worry about S. mitis as signal from its genes should dominate. Would this seem sensible, and how easy would it be to edit the S. oralis pangenome in this way?
Thanks again for such a useful tool!
To add with my thoughts so far. I can rank Streptococcal species by their median relative abundance across my samples (caveat: GTDB reference) and then for every COG which occurs in multiple Streptococcal pangenomes, exclude it from all those except the most abundance species. This would make sense since the most abundant species should have the dominant signal.
Potential problems: if a COG is included in error in pangenome A (e.g. 1 spurious genome) and is present in most members of pangenome B, but A is more abundant, I will preserve the COG in pangenome A and drop it from B. Also, the ranked abundance will not be consistent across all samples, so just because species A has a higher median abundance, doesn’t mean it is higher abundance across all samples.
I tried prioritising species by three approaches: median abundance, COG prevalence among pangenomes, and multiple of the two. Interestingly, it made very little difference to which species-COG combinations were flagged for exclusion (~3% flags changed species if pangenome prevalence used). I’ve thus stuck with the crude median abundance ranking.
We end up with Strep. mitis preserving all its pangenome, whereas Strep. pseudopneumoniae (with a small pangenome to start with) loses 65% of COGs. Others lose less.
Rather than editing the pangenomes and rerunning PanPhlAn, I am exporting lists of the internal gene names to filter out of the map results, thus I can use the same output.
Without this filtering, PanPhlAn reports pangenome results for 190 samples for Strep. mitis, 187 for S. pseudopneumoniae, and 89 for Strep. sanguinis. With the exclusions, it’s no longer possible to get any sample results for Strep. pseudopneumoniae (despite the fact that median relative abundance is 7.8%) and Strep. sanguinis results are available for more samples (113 now) despite median relative abundance only 0.5%!
Anyhow, just some early experimentation!
Looks like a very nice analysis, however it is indeed not the easiest one.
First I’m wondering why defining pangenome as sets of COG ? A significant part of the genes might have no annotation available. On top of that I expect the level of specificity to be lower than UniRef90 IDs, thus leading to more overlaps, but I might be wrong…?
For me there is no clear solution for overlapping pangenome. I would advise to export the PanPhlAn results as presence/absence matrix and as coverage matrix as well. (
--o_covmat in PanPhlan profiling).
In each pangenome, the coverage curve (see example in the PanPhlAn wiki) is expected to be organized in three parts, a plateau with the actual present genes, a left part before the plateau with higher coverage representing multicopy genes and the rightmost part, the tail of the curve, with non present genes.
A solution could be to use some MetaPhlAn coverage results to get the coverage of species markers (core specific genes). Then define custom PanPhlAn thresholds based on this expected coverage for each species.
Then it might be possible to disentangle and see if the coverage of the shared genes is close to the coverage of one of the species markers, or to the sum of it. Of course, the coverage is not always uniform across all the genome, but it might be worth a try.
Generating pangenomes based on GTDB hierarchies could be a very valuable approach. However I think it will only fix the genomes with the wrong species label. Overlaps will still be a thing.
Thanks @leonard.dubois for your valuable suggestions. Marrying the analysis with MetaPhlAn results does sound very sensible. I will try this out if I find time!