Hi,
we are using the database that comes with PanPhlAn to extract the gene sequences by combining the annotations in <species>_pangenome.tsv
with the <species>_pangenome_contigs.fna
. However, this led to a problem that I would like to bring to your attention.
According to the GFF file format description (https://www.ensembl.org/info/website/upload/gff.html) start- and end-positions are given as a 1-based integer, while both given positions are included.
However the PanPhlAn Database seems to be inconsistent on the start position format and mixes 1-based and 0-based start positions. The format seems to depend on the species (I will include a list of species where we needed to shift the positions by 1 in order to extract the correct sequences at the end.)
Example 1: E. coli positions are 0-based
Let me elaborate on this with an example. E. coli lists the start positions as a 0-based integer. From Escherichia_coli_pangenome.tsv
we get:
UniRef90_A0A2S5U8H2 cytR_1 GCA_001283905 CYEA01000007.1 130277 131450
The resource at NCBI (GCA_001283905.2_8205_3_35_genomic.gff.gz) has the following entry:
CYEA01000007.1 EMBL gene 130278 131450 . - . ID=gene-ERS139232_01718;Name=cytR_1;gbkey=Gene;gene=cytR_1;gene_biotype=protein_coding;locus_tag=ERS139232_01718
CYEA01000007.1 EMBL CDS 130278 131450 . - 0 ID=cds-CTX09383.1;Parent=gene-ERS139232_01718;Dbxref=NCBI_GP:CTX09383.1;Name=CTX09383.1;gbkey=CDS;gene=cytR_1;inference=similar to AA sequence:RefSeq:YP_002411761.1;locus_tag=ERS139232_01718;product=putative transcriptional regulator;protein_id=CTX09383.1;transl_table=11
The start position differs by one when comparing the GFF and <species>_pangenome.tsv
.
Example 2: Acetobacter aceti positions are 1-based
Let’s take a look at another species e.g. Acetobacter aceti.
Again, from the Acetobacter_aceti_pangenome.tsv
we look at the first entry:
UniRef90_A0A0D6MW42 Abac_022_017 GCA_000963905 BAMU01000022.1 16270 16692
However, at the GFF from NCBI we get:
BAMU01000022.1 DDBJ gene 16270 16692 . + . ID=gene-Abac_022_017;Name=Abac_022_017;gbkey=Gene;gene_biotype=protein_coding;locus_tag=Abac_022_017
BAMU01000022.1 DDBJ CDS 16270 16692 . + 0 ID=cds-GAN57884.1;Parent=gene-Abac_022_017;Dbxref=NCBI_GP:GAN57884.1;Name=GAN57884.1;gbkey=CDS;locus_tag=Abac_022_017;product=hypothetical protein;protein_id=GAN57884.1;transl_table=11
The position here is matching the position given in the GFF file (so it should be 1-based).
List of species where positions seem to be 0-based
For you to look into this issue I include a list of species, where I need to handle the start positions as 0-based integers:
- Acinetobacter_baumannii
- Escherichia_coli
- Klebsiella_pneumoniae
- Mycobacterium_tuberculosis
- Salmonella_enterica
- Vibrio_parahaemolyticus
- Micrococcus_luteus
- Bordetella_pertussis
- Vibrio_cholerae
- Staphylococcus_aureus
Best,
Zeno Sewald