Hi @leonard.dubois,
thanks for your quick answer.
Sadly I have not found the scheme here.
Initially I suspected it may be an issue with the pipeline handling species, for which more than 200 genomes were available, as it is mentioned in your paper that these are handled extra.
As I did not have a list of these species I applied my fix to all species where 200 different strains are listed (without knowing it was 200+) in the annotation as a proxy.
However, I needed to remove most of the species of this list again.
Out of the list of 32 species where the number of listed strains is 200 only these 6 need to be included:
- Acinetobacter_baumannii
- Escherichia_coli
- Klebsiella_pneumoniae
- Mycobacterium_tuberculosis
- Salmonella_enterica
- Vibrio_parahaemolyticus
If this is the list of species where the number of genomes was 200+ than this would be a lead.
However, I needed to include these 4 species where my assumption failed.
- Micrococcus_luteus
- Bordetella_pertussis
- Vibrio_cholerae
- Staphylococcus_aureus
The Micrococcus luteus annotation lists only 22 strains while the three others list 199 strains.