Seems like a very interesting analysis, I actually never tried using MAGs for custom pangenome generation…yet. Definitely something I want to try in the future
From my own experience, running PanPhlAn with sensitive option
--min_coverage 1 --left_max 1.70 --right_min 0.30 usually filters the sample in a satisfying way. However, I work with large scale analysis, usually with more than a thousand samples and don’t mind discarding too much of them.
I would also advise to use these parameters:
--th_non_present 0.25 --th_present 0.5 instead of the default ones. It will be more stringent/robust on the single gene presence/absence assessment.
With these parameters, profile everything and look if depth could be a confounding factor. I would advise you to check
- the average/distribution of the number of genes families present in strains
- the pairwise Jaccard distance between groups (intra group/inter groups comparison)
However I guess in your case you should also find a way to disentangle that effect from the variability that can happen between your groups of interest. Depends on the openness of your species pangenome, the size of its core genome compared to the accessory one…
Feel free to ask if some things are still not clear