I have created clusters comparing strains identified in my samples to reference samples based on the results profile, and now I wanted to work on characterizing the difference in these strains through functional analysis.
To start I was planning on looking at IDs specific to each cluster, but I am having trouble determining how to do an enrichment analysis since I dont know of an efficient way to convert uniref90/50 names to any other meaningful gene name. Also I was considering using KO IDs, but there only around 400 KO ids compared to over 10,000 uniref90 ids in the pangenome I’m working with. Do you have any recommendation on how I could carry out a functional analysis?
I don’t know which species you are working on, but you should have a file in the downloaded pangenome folder called
panphlan_[species_name]_annot.tsv containing some mapping of the UniRef90 IDs to several other. If KO IDs are too many, I would advise the COG IDs, either the ones you’ll get in the panphlan provided file, or you can also map them to COG functional categories (one letter code, there might be like ~20 of them, that’s maybe even too few).
Otherwise, I would rather advise to have a more specific question and use a tool designed for it, for example dbCAN2 for Carbohydrate Active Enzymes (CAZy), abricate for antibiotic resistance genes…
Another simple solution could be to stick to the KO IDs but filter them down to some prevalence threshold across the pangenome
Hope that will help you sort this out, feel free to ask if you have more question
Have a nice day