How to get "taxonomy.tsv" using a self-built database

Hi there!

Waafle is a very useful tool and I am more interested in using it for environmental microbiomes. I have replaced the database with blast database v5, but I am having trouble building a file like the “waafledb_taxonomy.tsv” file in the demo because it is not clear to me how to build this based on the blast database v5 file.

Can you provide some information about building the “waafledb_taxonomy.tsv” file?

Thanks in advance,
Billitones

WAAFLE’s taxonomy file is just a listing of parent-child taxon pairs, so it’s pretty easy and flexible. There ought to be an NCBI taxonomy associated with the BLAST database, but it would probably need to be cleaned up and reformatted a lot to match WAAFLE expectations (e.g. WAAFLE assumes that you have the same taxonomic ranks from root to species tips for all species in the taxonomy).

Thank you for your prompt reply, for most of the categorized form creation I followed the example below:

Place

k__Bacteria; p__Gemmatimonadetes; c__Gemmatimonadetes; o__Gemmatimonadales; f__Gemmatimonadaceae; g__Gemmatimonas; s__

Rewrite as

p__Gemmatimonadetes k__Bacteria
c__Gemmatimonadetes p__Gemmatimonadetes
o__Gemmatimonadales c__Gemmatimonadetes
f__Gemmatimonadaceae o__Gemmatimonadales
g__Gemmatimonas f__Gemmatimonadaceae
s__ g__Gemmatimonas

If this rewrite is correct I have the following questions:

  1. can I just delete the entries that are missing the s__ classification in the waafledb_taxonomy.tsv file, but can I just ignore them in the nhr,nin and nsq files?
  2. in the existing waafledb_taxonomy.tsv file, I notice that the classification information will eventually be associated with the serial number of the refseq database, such as GCF_000504785, but can the refseq database and blast database be used directly by each other?
  3. the database I replaced is stored directly in the waafledb folder under the unzipped name (e.g. nt.76.nsq), do I need to change it to the waafledb.04.nsq form?

Looking forward to your reply
Billitones

Re: 1) I don’t think WAAFLE will like it if you’re getting hits to taxa that don’t show up in the taxonomy file. An easy way to resolve the above situation is to label the sequence as s__Gemmatimonas_unclassified and add

s__Gemmatimonas_unclassified g__Gemmatimonas

to the taxonomy (or just remove sequences with incomplete taxonomy).

Re: 2) This was separate information we added about the number of genomes supporting each species-level taxon. It should not be required for WAAFLE to work. WAAFLE’s original database (derived from the HUMAnN 2 pangenome database) was built by clustering the genes from RefSeq genomes according to species into pangenomes. It’s not a direct download of RefSeq / an equivalent BLAST database.

Re: 3) The database can have any name. You just need to point WAAFLE at it using the common prefix for all the files, e.g. /path/to/database/nt.

Dear Franzosa,

Thank you very much for your detailed guidance and suggestions. I will follow your advice to label the sequences appropriately.

Thank you again for your help, and I will share the results and feedback with you after the testing.

Best regards,
Billitones

Tested with the previously downloaded nt database, but got unsatisfactory blastout results. For the sseqid information, it was not provided in the results.

The results are shown below:

|k141_11310|Unknown|1652|0|0|0|0|0|0|0|0|0|0|0.000|0|0|7.17e-138|499|N/A|
|k141_11310|Unknown|1652|0|0|0|0|0|0|0|0|0.000|0|0|7.37e-118|433|N/A|
|k141_11310|Unknown|1652|0|0|0|0|0|0|0|0|0.000|0|0|3.55e-91|344|N/A|
|k141_11310|Unknown|1652|0|0|0|0|0|0|0|0|0.000|0|0|0|8.62e-08|67.6|N/A|
|k141_13573|Unknown|1100|0|0|0|0|0|0|0|0|0.000|0|0|0|1.16e-44|189|N/A|
|k141_13573|Unknown|1100|0|0|0|0|0|0|0|0|0.000|0|0|0|1.99e-27|132|N/A|

|k141_13573|Unknown|1100|0|0|0|0|0|0|0|0|0.000|0|0|3.39e-10|75.0|N/A|
|k141_13573|Unknown|1100|0|0|0|0|0|0|0|0.000|0|0|2.04e-07|65.8|N/A|
|k141_13573|Unknown|1100|0|0|0|0|0|0|0|0.000|0|0|0|2.64e-06|62.1|N/A|

According to your previous statement that WAAFLE’s database is built based on the HUMAnN 2 pangenome database, am I to understand that it is one of the chocophlan databases?

By referring to the answer below, I was wondering if it is possible to generate fasta files for WAAFLE via Struo2 and build the corresponding database via blast. Or is there any better workflow for building usable database files?
[Different chocophlan databases? - #6 by jolespin](https://forum.biobakery.org/t/different-chocophlan-databases/ 368/6)

Also, I was using the nt database before just because it could provide the taxonomy file, and it was easy to script the “waafledb_taxonomy.tsv” file, but I don’t have any thoughts on that part if I do it through Struo2.

I am looking forward to your reply!

Dear Franzosa ,

I have the following questions regarding customized database construction.

  1. Is it acceptable to ignore species pan-genomes as I am working on microbiomes in the environment. Is it acceptable to use refseq sequence files directly for ORF prediction and gene clustering by skipping the bin steps?
  2. If I use the refseq GCF sequence files directly, do I still need to annotate them with uniref_90 separately. Given that the annotation information for the protein is already included with the faa file and the corresponding gpff file.
  3. Whether the non-redundant gene set fasta file header information used to construct blast conforms to the following paradigm:
    GENE_index|s__species_name_sp_strain_number|UniRef90=UniRef90_index

I am looking forward to your reply!

Best regards,
Billitones

In reply to the earlier message, Struo(2) wasn’t build by us, so it’s hard for me to comment on using it with WAAFLE. WAAFLE’s database formats were designed to be strictly defined but pretty simple in formatting, such that other appropriate databases could be easily adapted to them. It was convenient for us to do that with HUMAnN 2’s pangenome database (which is indeed a version of ChocoPhlAn), but other similar microbial gene databases should also work.

On your BLAST output, something is wrong there. The sseqid field should correspond to the part of the database sequence (as indexed from a FASTA file) after > and before the first whitespace (or the whole header if there is no whitespace, which is how the default WAAFLE database was built). It looks like BLAST is having trouble finding that information in your headers.


In reply to the second message:

[1] Your database does not need to be built from pangenomes. Pangenomes are often convenient because they are compressed relative to repeated isolate genomes (less redundancy, faster search). But they are not a hard requirement.

[2] You can annotate to whatever you like / use whatever annotations you have (including no annotations). The default databases use UniRef90s since that’s what HUMAnN uses (so again they were chosen for convenience to us).

[3] I’m not sure what you’re asking on this one? The key idea is the first piped field is a unique gene index, the second is a taxonomic identifier (of a tip) that will show up in your taxonomy, and any subsequent piped A=B fields are assumed to be functional annotations.