You mention a chocophlan to uniref90 map in the 2018 Nature Methods paper, but I cannot find that map. It would really help to have access to that since I ran HUMAnN2 with chocophlan but I want to compare genes from uniref90 from the literature and do not want to re-run HUMAnN at this time. Thanks!
That map doesn’t exist as a separate file but is embedded in the ChocoPhlAn headers. If you look at the FASTA files corresponding to the different species pangenomes (i.e. the ChocoPhlAn database), each gene header was decorated by us with its UniRef90/50 annotations (if known) and the gene’s length. The rest of the header can be used to trace the gene back to a source genome.
That was super helpful! Thanks for the quick reply!
@franzosa sorry I’m confused a bit, from the ChocoPhlan database FASTA files (of centroids), how do I trace back the source strain genomes? I’m trying to compile a list of all the source strain genomes used to build the entire ChocoPhlan database. For example, what are the source strain genome identifiers used as a basis for the following centroid?
>542__Q5NMT2__A254_00003|k__Bacteria.p__Proteobacteria.c__Alphaproteobacteria.o__Sphingomonadales.f__Sphingomonadaceae.g__Zymomonas.s__Zymomonas_mobilis|UniRef90_Q5NMT2|UniRef50_A0A173KXJ4|1101
The Q5NMT2
identifier there is a UniProt entry. If you look that up…
https://www.uniprot.org/uniprotkb/Q5NMT2
You’ll see it comes from the genome of a particular strain, in this case “Zymomonas mobilis subsp. mobilis (strain ATCC 31821 / ZM4 / CP4).” However, this isn’t a reliable way to get a complete listing of strains since, for example, if two strains were ~identical we’d only have one representative sequence per protein family in the pangenome file (and theoretically they could all be from one of the two identical strains). I will attach an approximate list of the source genomes we used for building the bioBakery 3 databases below.
bb3_genomes.txt (1.3 MB)
Thanks very much!
If at sometime you and your team could publish on GitHub the database creation workflow for how you, from the starting NCBI and UniProt online resources, you got these assembly identifiers and then how it all ends with ChocoPhlan gene centroid sequences, that would be very valuable to the community of users and I think others would like to see the same.
For me I find it best to learn and understand the details by looking at code compared to reading an English written protocol summary from the eLife and Nat Methods papers.
Or @franzosa if I missed it somewhere in the bioBakery · GitHub Github organization codebase, at least how you arrived at this set of ~98k strain reference genomes from NCBI and UniProt online resources?