Inconsistent Format for Gene Start Positions in PanPhlAn Database

Hi,

we are using the database that comes with PanPhlAn to extract the gene sequences by combining the annotations in <species>_pangenome.tsv with the <species>_pangenome_contigs.fna. However, this led to a problem that I would like to bring to your attention.

According to the GFF file format description (https://www.ensembl.org/info/website/upload/gff.html) start- and end-positions are given as a 1-based integer, while both given positions are included.

However the PanPhlAn Database seems to be inconsistent on the start position format and mixes 1-based and 0-based start positions. The format seems to depend on the species (I will include a list of species where we needed to shift the positions by 1 in order to extract the correct sequences at the end.)

Example 1: E. coli positions are 0-based
Let me elaborate on this with an example. E. coli lists the start positions as a 0-based integer. From Escherichia_coli_pangenome.tsv we get:

UniRef90_A0A2S5U8H2     cytR_1  GCA_001283905   CYEA01000007.1  130277  131450

The resource at NCBI (GCA_001283905.2_8205_3_35_genomic.gff.gz) has the following entry:

CYEA01000007.1  EMBL    gene    130278  131450  .       -       .       ID=gene-ERS139232_01718;Name=cytR_1;gbkey=Gene;gene=cytR_1;gene_biotype=protein_coding;locus_tag=ERS139232_01718
CYEA01000007.1  EMBL    CDS     130278  131450  .       -       0       ID=cds-CTX09383.1;Parent=gene-ERS139232_01718;Dbxref=NCBI_GP:CTX09383.1;Name=CTX09383.1;gbkey=CDS;gene=cytR_1;inference=similar to AA sequence:RefSeq:YP_002411761.1;locus_tag=ERS139232_01718;product=putative transcriptional regulator;protein_id=CTX09383.1;transl_table=11

The start position differs by one when comparing the GFF and <species>_pangenome.tsv.

Example 2: Acetobacter aceti positions are 1-based
Let’s take a look at another species e.g. Acetobacter aceti.
Again, from the Acetobacter_aceti_pangenome.tsv we look at the first entry:

UniRef90_A0A0D6MW42     Abac_022_017    GCA_000963905   BAMU01000022.1  16270   16692

However, at the GFF from NCBI we get:

BAMU01000022.1  DDBJ    gene    16270   16692   .       +       .       ID=gene-Abac_022_017;Name=Abac_022_017;gbkey=Gene;gene_biotype=protein_coding;locus_tag=Abac_022_017
BAMU01000022.1  DDBJ    CDS     16270   16692   .       +       0       ID=cds-GAN57884.1;Parent=gene-Abac_022_017;Dbxref=NCBI_GP:GAN57884.1;Name=GAN57884.1;gbkey=CDS;locus_tag=Abac_022_017;product=hypothetical protein;protein_id=GAN57884.1;transl_table=11

The position here is matching the position given in the GFF file (so it should be 1-based).

List of species where positions seem to be 0-based
For you to look into this issue I include a list of species, where I need to handle the start positions as 0-based integers:

  • Acinetobacter_baumannii
  • Escherichia_coli
  • Klebsiella_pneumoniae
  • Mycobacterium_tuberculosis
  • Salmonella_enterica
  • Vibrio_parahaemolyticus
  • Micrococcus_luteus
  • Bordetella_pertussis
  • Vibrio_cholerae
  • Staphylococcus_aureus

Best,
Zeno Sewald

1 Like

Hello,

that’s very interesting thanks for having spotted it and thanks for the clear explanation. We might be able to correct and improve the pangenome database. Do you know if there is a way of detecting which species need a correction without having to check to NCBI GFF by hand?

Thanks again anyway for raising this issue
Have a nice day!
Léonard

Hi @leonard.dubois,

thanks for your quick answer.

Sadly I have not found the scheme here.

Initially I suspected it may be an issue with the pipeline handling species, for which more than 200 genomes were available, as it is mentioned in your paper that these are handled extra.

As I did not have a list of these species I applied my fix to all species where 200 different strains are listed (without knowing it was 200+) in the annotation as a proxy.

However, I needed to remove most of the species of this list again.
Out of the list of 32 species where the number of listed strains is 200 only these 6 need to be included:

  • Acinetobacter_baumannii
  • Escherichia_coli
  • Klebsiella_pneumoniae
  • Mycobacterium_tuberculosis
  • Salmonella_enterica
  • Vibrio_parahaemolyticus

If this is the list of species where the number of genomes was 200+ than this would be a lead.

However, I needed to include these 4 species where my assumption failed.

  • Micrococcus_luteus
  • Bordetella_pertussis
  • Vibrio_cholerae
  • Staphylococcus_aureus

The Micrococcus luteus annotation lists only 22 strains while the three others list 199 strains.

I believe the given list of species is more or less complete, as the gene extraction works fine for me so far when I apply the fix to these.

However, @leonard.dubois I have another question. Are the annotations based on the gene entry of the GFF file or is it based on the CDS line. I found some examples where my extracted sequence included some regulatory elements. Is this intentional?

Please notice the post about Duplicate entries in the contigs also from our group, which was another question raised while extracting the genes.

We are going back into some old code, it seems to filter the gff “gene” entries. For the duplication issues, it seem to be the case for the genome having multiple copies of the same gene name (while gene ID are unique, gene name can be redundant in the GFF). In such cases, the first entry was considered.

We have to double check a couple of things, maybe we’re good for re-generating the whole database.
I’ll keep you updated

Sounds good. Thanks for your answers!

I see, so the “duplication” is also related to my misconception that the database is based on the CDS entries. The given positions for both entries are the positions of the gene than.

Have a nice day!

Hi again,

I had a systematic look at the number of genes raising that issue, here is a
summary file (74.9 KB) with unique number of genes names and number of genes being duplicated for each species in the PanPhlAn database. In the huge majority of the cases, it represent less than 1% of the genes.

That is anyway something we’ll keep in mind and we’ll fix in the next release. I wouldn’t be worried about the presence-absence matrix generated by PanPhlAn, apart for a few of species in which this issue is quite frequent. But if your research project and questions are a bit more specific than simple strain gene content profiling from metagenomics, I advise you to have a systematic look at the GFF from NCBI for that genome

Thanks again for pointing out this issue.

Hi @leonard.dubois, I would like to ping you on this topic and ask what your planned release cycle is for the database. We are currently excluding species with a high number of these duplicate entries (thanks for posting the list btw!) but would like to move to a new release of the database as soon as it becomes available.

Have a nice week!