Gene sequences fasta files for bowtie2 and diamond index

Hello

I am trying to build my customized database for the bowtie2 and diamond.
And i am wondering if you could let me know where i can download these gene sequences fasta files and their annotation files or map file to the annotations.

Thank you.

Can you please expand on this? What are you trying to do and with which input sequences?

Hello Eric

Sorry i did not make my question quite clear.
here is what i am trying to do:
I am trying to build a customized database based on the database used by humann 4.0.
I understand that humann uses bowtie2 (and it uses the DNA sequence for database index) and diamond (and it uses the protein for database index) for the mapping. Therefore, I am looking for

  1. all the genes’ DNA sequences you used for building the bowtie index (if you used genes’ sequences instead of the genome sequences)
  2. all the genes’ protein sequences used for building the diamond index,
  3. all the genes’ and protein’s annotation files (but i assume genes in your database have both paired DNA and protein sequences) such as taxon they are derived from, gene names, gene symbols and KO and GO, whichever you have.

And i have also asked a question about Metaphian4 in the forum, but so far i have not received response. I am wondering if you could help to answer a question.
I understand mpa_vOct22_CHOCOPhlAnSGB_202403_VSG.fna.gz is the marker sequences file, my questions are that

  1. are these marker genes in the mpa_vOct22_CHOCOPhlAnSGB_202403_VSG.fna.gz file protein-coding genes?
  2. if yes, are they intact genes (complete cds) which i will use them to get the corresponding protein sequences?
  3. and their annotation file such as gene name, symbol, GO, KO, which i guess could be related with humann database.
    Thank you.

Hello Eric,

I hope you’re doing well.

I just wanted to gently follow up on the questions I had asked earlier. I would greatly appreciate any insights or guidance you could provide. If anything is unclear in my original questions, please don’t hesitate to let me know, and I’d be happy to clarify.

Thank you so much for your time and help!

If you download the full version of the HUMAnN 4 ChocoPhlAn database, it includes a pangenome for each SGB recognized by MetaPhlAn. These pangenomes include the gene sequences (DNA) with annotation information in the headers. To build the protein database, these sequences are translated and clustered following the UniRef50 criteria, and also supplemented with additional UniRef50 sequences from UniProt that do not otherwise arise across the SGBs. Genes and proteins inherit the functional annotations that UniProt provides (e.g. GO, KO, Pfam, etc.) and we do a small amount of new annotation for novel proteins in the SGBs, focusing on ECs. We include the tip taxonomy of each SGB in its file name and gene headers; the rest of the taxonomy comes from MetaPhlAn’s database.