Chocophlan to UniRef90 map

lalas25 · January 29, 2021, 2:26pm

You mention a chocophlan to uniref90 map in the 2018 Nature Methods paper, but I cannot find that map. It would really help to have access to that since I ran HUMAnN2 with chocophlan but I want to compare genes from uniref90 from the literature and do not want to re-run HUMAnN at this time. Thanks!

franzosa · January 29, 2021, 2:52pm

That map doesn’t exist as a separate file but is embedded in the ChocoPhlAn headers. If you look at the FASTA files corresponding to the different species pangenomes (i.e. the ChocoPhlAn database), each gene header was decorated by us with its UniRef90/50 annotations (if known) and the gene’s length. The rest of the header can be used to trace the gene back to a source genome.

lalas25 · January 29, 2021, 4:01pm

That was super helpful! Thanks for the quick reply!

hermidalc · June 28, 2022, 7:13pm

@franzosa sorry I’m confused a bit, from the ChocoPhlan database FASTA files (of centroids), how do I trace back the source strain genomes? I’m trying to compile a list of all the source strain genomes used to build the entire ChocoPhlan database. For example, what are the source strain genome identifiers used as a basis for the following centroid?

>542__Q5NMT2__A254_00003|k__Bacteria.p__Proteobacteria.c__Alphaproteobacteria.o__Sphingomonadales.f__Sphingomonadaceae.g__Zymomonas.s__Zymomonas_mobilis|UniRef90_Q5NMT2|UniRef50_A0A173KXJ4|1101

franzosa · June 28, 2022, 7:41pm

The Q5NMT2 identifier there is a UniProt entry. If you look that up…

https://www.uniprot.org/uniprotkb/Q5NMT2

You’ll see it comes from the genome of a particular strain, in this case “Zymomonas mobilis subsp. mobilis (strain ATCC 31821 / ZM4 / CP4).” However, this isn’t a reliable way to get a complete listing of strains since, for example, if two strains were ~identical we’d only have one representative sequence per protein family in the pangenome file (and theoretically they could all be from one of the two identical strains). I will attach an approximate list of the source genomes we used for building the bioBakery 3 databases below.

bb3_genomes.txt (1.3 MB)

hermidalc · June 28, 2022, 8:42pm

Thanks very much!

If at sometime you and your team could publish on GitHub the database creation workflow for how you, from the starting NCBI and UniProt online resources, you got these assembly identifiers and then how it all ends with ChocoPhlan gene centroid sequences, that would be very valuable to the community of users and I think others would like to see the same.

For me I find it best to learn and understand the details by looking at code compared to reading an English written protocol summary from the eLife and Nat Methods papers.

Or @franzosa if I missed it somewhere in the bioBakery · GitHub Github organization codebase, at least how you arrived at this set of ~98k strain reference genomes from NCBI and UniProt online resources?

Topic		Replies	Views
Count of individual genes from ChocoPhLan database rather than UniRef gene family based RPK HUMAnN	2	469	January 8, 2021
Different chocophlan databases? HUMAnN	6	4837	July 7, 2023
Different UniRef90 ID has the same nucleotide sequences in ChocoPhlAn database HUMAnN	3	514	August 4, 2020
Humann3 chocophlan duplicate IDs HUMAnN	2	269	August 16, 2021
No UniRef90 IDs from Humann3 have information in UniProfKB site? HUMAnN	2	511	September 18, 2020

Chocophlan to UniRef90 map

Related topics