I’ve been reading through the bioprotocol and looking though the chocophlan database to try to better understand the structure, and I have two questions:
It looks like a lots of uniprot accessions have been removed by the uniprot team; comparing the chocophlan headers to the file of deleted accessions found here, it looks like ~6.3M of the ~54.5M chocophlan records have been deleted by uniprot. Are those retained intentionally, or should we exclude those hits from our analyses?
The chocophlan data base has some extremely long records. In cases like
40993__A0A1R3RXX1__ASPCADRAFT_203126|k__Eukaryota.p__Ascomycota.c__Eurotiomycetes.o__Eurotiales.f__Aspergillaceae.g__Aspergillus.s__Aspergillus_carbonarius|UniRef90_A0A1R3RXX1, the chocophlan fasta has 199952 bases, but the Uniprot entry show the protein should only have 723 bases. What is all that extra content in the chocophlan DB?
Thanks in advance!
Re: 1 - We don’t exclude such protein IDs from analyses since they are still discoverable within the ecosystem of files that HUMAnN uses (since those files were built at a time before the IDs had been deleted). Some of them could be bad ORFs that were discovered later, in which case I’d be surprised if they contributed to higher-level functional categories (KOs, reactions, etc.). I think a lot of them are just cases where a new reference has been provided for what is essentially the same protein (e.g. because the underlying genome file was refined a bit), in which case they should still be mostly OK for mapping purposes.
Re: 2 - This one is more of a puzzle to me. One of the steps we have to do to link UniRef families to DNA sequences involves tracing back to the corresponding coordinates in the source genome. If those coordinates were off we might’ve pulled a much longer than appropriate DNA sequence for the UniRef family in that genome. I’d be more inclined to exclude these from downstream analyses once you’ve found them since their abundances won’t be reliable.
Thanks for your reply @franzosa !
Re 1: Anecdotally, in a test dataset of 8 replicates the Zymo gut mock community, it looks like 7% of the overall abundance is accounted for by those bad accessions. What would it take for the humann database files to be rebuilt with a recent version of uniprot? A lot of these have been obsolete for many years.
Re 2: Attached is a list of some of the big accessions to hopefully help identify the issue
big_accessions.csv (103.1 KB)
. After removing the deleted accessions, I retained any exceeding 31421bp (giant protein ebH). This only looks at the extra big ones though, it sounds like there could be issues where a 2kbp protein is cataloged in chocophlan as being 5kbp, which this crude filter would miss.
Re: 1 - With energy currently directed toward v4 database development I do not expect we’ll do another update along the v3 database branch unless it’s to fix a bug (which the ultra-large sequences - your point #2 - might be). If you are concerned about the obsolete entries you can always replace them with
UniRef50_unknown, which is how HUMAnN models sequences without valid UniRef90/50 assignments.
That said, I would only do this if you were worried about including an obsolete UniRef ID in something like a figure. I would still trust the IDs for the main way HUMAnN uses them, i.e. linking to other functional annotation systems. For example, if sequence X was homologous to (now obsolete) UniRef90_Y, and Y was annotated to Pfam domain Z, then seeing reads map to sequence X would likely still informative for the presence of Z.
Re: 2 - Thanks for the list. We’ll look into this.