The bioBakery help forum

Different UniRef90 ID has the same nucleotide sequences in ChocoPhlAn database

Hi all,
I’m trying to run HUMAnN 3.0 without any translated protein search.
Here is the command I use:

$ cat Sample1_R1.trimmed.fastq.gz Sample1_R2.trimmed.fastq.gz > Sample1_trimmed.fastq.gz
$ humann --input Sample1_trimmed.fastq.gz --output /shotgun_seq/humann3 --threads 24 --bypass-translated-search

Then, I get sequences from ChocoPhlAn database through the UniRef90 ID in genefamilies.tsv.
However, it seems like there are different UniRef90 IDs that have the same nucleotide sequences in ChocoPhlAn database. For example, UniRef90_A0A1A9P878 and UniRef90_A0A379QCR7.

Could you please tell me why these two gene families have different RPKs? Or did I do something wrong?


  • genefamilies.tsv
UniRef90_A0A1A9P878|g__Klebsiella.s__Klebsiella_pneumoniae      34.5649582837
UniRef90_A0A379QCR7|g__Klebsiella.s__Klebsiella_pneumoniae      40.5098326496
  • Sample1_trimmed_bowtie2_aligned.txt
$ grep 'UniRef90_A0A1A9P878\|UniRef90_A0A379QCR7' Sample1_trimmed_bowtie2_aligned.txt | grep 'g__Klebsiella.s__Klebsiella_pneumoniae' | cut -f 2 | sort | uniq -c

58 573__A0A1A9P878__recQ_2|k__Bacteria.p__Proteobacteria.c__Gammaproteobacteria.o__Enterobacterales.f__Enterobacteriaceae.g__Klebsiella.s__Klebsiella_pneumoniae|UniRef90_A0A1A9P878|UniRef50_A0A1A9P878|1827
68 573__A0A379QCR7__recQ_2|k__Bacteria.p__Proteobacteria.c__Gammaproteobacteria.o__Enterobacterales.f__Enterobacteriaceae.g__Klebsiella.s__Klebsiella_pneumoniae|UniRef90_A0A379QCR7|UniRef50_A0A2T1LC21|1827
  • g__Klebsiella.s__Klebsiella_pneumoniae.centroids.v296_201901.ffn.gz
>573__A0A1A9P878__recQ_2|k__Bacteria.p__Proteobacteria.c__Gammaproteobacteria.o__Enterobacterales.f__Enterobacteriaceae.g__Klebsiella.s__Klebsiella_pneumoniae|UniRef90_A0A1A9P878|UniRef50_A0A1A9P878|1827
GTGGCACAGGCGGAAGTATTAAATCAGGAATCGCTGGCTAAGCAGGTTTTACAAGAGACC
TTCGGCTACCAGCAGTTCCGTCCTGGCCAGGAAACGATTATCGAGACGGCGCTCGAAGGC
CGGGACTGCCTGGTGGTCATGCCGACCGGTGGCGGCAAGTCGCTGTGCTATCAGGTGCCG
GCGCTGGTCATGGGCGGTCTGACGGTCGTGGTCTCACCGCTGATCTCGCTGATGAAGGAC
CAGGTCGATCAGCTGCTGGCCAACGGCGTGGCGGCGGCTTGTCTGAACTCGACGCAAAGC
CGCGAGCAGCAGCAGGAGGTGATGGCCGGCTGCCGCAGCGGGCAGGTTCGTCTGCTGTAT
ATCGCGCCGGAACGGCTGATGCTGGATAACTTTCTTGAGCATCTGGCGAACTGGAACCTG
GCGATGCTGGCGGTAGACGAGGCGCACTGTATCTCGCAGTGGGGCCATGACTTCCGTCCG
GAATATGCCGCGCTGGGCCAGCTGCGTCAGCGGATGCCGCAGATCCCGTTTATGGCGTTG
ACCGCCACCGCCGATGATACCACCCGCCGCGATATCGTCCGCCTGCTGGGGCTTAACGAT
CCGCTGATTCAGGTCAGCAGCTTCGACCGGCCAAACATCCGCTATATGCTGATGGAGAAA
TTCAAGCCGCTCGATCAGCTGATGCGCTACGTTCAGGATCAGCGCGGCAAATCGGGCATT
ATCTACTGCAACAGCCGTTCGAAAGTGGAAGACACCGCCGCCAGGCTGCAAAGCCGCGGT
ATTAGCGCGGCGGCTTACCATGCCGGTCTGGAAAACGACGTGCGCGCCGAGGTGCAGGAG
AAATTCCAGCGCGACGATCTGCAGATCGTGGTGGCGACGGTGGCCTTCGGGATGGGCATT
AACAAGCCGAACGTCCGCTTTGTGGTGCATTTTGATATTCCGCGCAATATAGAATCCTAC
TATCAGGAGACCGGCCGCGCCGGGCGTGATGGTCTGCCGGCGGAAGCGATGCTGTTTTAC
GATCCGGCGGATATGGCGTGGCTGCGCCGCTGTCTGGAAGAAAAACCCGCCGGGCCGCTA
CAGGATATCGAACGGCATAAGCTGAATGCGATGGGGGCGTTTGCCGAAGCGCAGACCTGT
CGCCGTCTGGTGCTGCTGAACTATTTTGGCGAAGGGCGTCAGGAGCCGTGCGGCAACTGC
GATATCTGTCTTGACCCGCCAAAGCAGTACGATGGCTTAATGGACGCCCGCAAGGCGCTT
TCAACGATTTACCGGGTCAATCAACGCTTCGGAATGGGTTACGTGGTGGAGGTCCTGCGC
GGGGCCAACAACCAGCGCATCCGAGAGATGGGCCACGATAAGCTGCCGGTTTACGGTATC
GGCCGGGAGCAAAGTCACGAGCACTGGGTGAGCGTGATCCGCCAGCTGATCCACCTTGGG
CTGGTGACGCAGAATATCGCCCAGCACTCCGCGCTGCAGCTGACCGAAGCCGCGCGACCG
GTGCTGCGTGGCGAAGTGCCGCTGCAGCTCGCCGTGCCGCGTATCGTGGCGCTGAAGCCA
AAGGCGATGCAGAAATCCTTTGGCGGCAATTACGACCGTAAACTGTTCGCCAAGCTGCGC
AAATTACGTAAAGCGATCGCCGACGAAGAGAACATCCCGCCATATGTGGTCTTCAACGAC
GCGACGCTTATCGAGATGGCCGAACAATCGCCGCTGACCGCCGGCGAAATGCTCAGCGTC
AACGGCGTGGGGACACGCAAGCTCGAGCGTTTCGGGAAGCCGTTTATGGCGCTGATCCGG
GCGCATGTTGATGGCGACGATGAGTAG
>573__A0A379QCR7__recQ_2|k__Bacteria.p__Proteobacteria.c__Gammaproteobacteria.o__Enterobacterales.f__Enterobacteriaceae.g__Klebsiella.s__Klebsiella_pneumoniae|UniRef90_A0A379QCR7|UniRef50_A0A2T1LC21|1827
GTGGCACAGGCGGAAGTATTAAATCAGGAATCGCTGGCTAAGCAGGTTTTACAAGAGACC
TTCGGCTACCAGCAGTTCCGTCCTGGCCAGGAAACGATTATCGAGACGGCGCTCGAAGGC
CGGGACTGCCTGGTGGTCATGCCGACCGGTGGCGGCAAGTCGCTGTGCTATCAGGTGCCG
GCGCTGGTCATGGGCGGTCTGACGGTCGTGGTCTCACCGCTGATCTCGCTGATGAAGGAC
CAGGTCGATCAGCTGCTGGCCAACGGCGTGGCGGCGGCTTGTCTGAACTCGACGCAAAGC
CGCGAGCAGCAGCAGGAGGTGATGGCCGGCTGCCGCAGCGGGCAGGTTCGTCTGCTGTAT
ATCGCGCCGGAACGGCTGATGCTGGATAACTTTCTTGAGCATCTGGCGAACTGGAACCTG
GCGATGCTGGCGGTAGACGAGGCGCACTGTATCTCGCAGTGGGGCCATGACTTCCGTCCG
GAATATGCCGCGCTGGGCCAGCTGCGTCAGCGGATGCCGCAGATCCCGTTTATGGCGTTG
ACCGCCACCGCCGATGATACCACCCGCCGCGATATCGTCCGCCTGCTGGGGCTTAACGAT
CCGCTGATTCAGGTCAGCAGCTTCGACCGGCCAAACATCCGCTATATGCTGATGGAGAAA
TTCAAGCCGCTCGATCAGCTGATGCGCTACGTTCAGGATCAGCGCGGCAAATCGGGCATT
ATCTACTGCAACAGCCGTTCGAAAGTGGAAGACACCGCCGCCAGGCTGCAAAGCCGCGGT
ATTAGCGCGGCGGCTTACCATGCCGGTCTGGAAAACGACGTGCGCGCCGAGGTGCAGGAG
AAATTCCAGCGCGACGATCTGCAGATCGTGGTGGCGACGGTGGCCTTCGGGATGGGCATT
AACAAGCCGAACGTCCGCTTTGTGGTGCATTTTGATATTCCGCGCAATATAGAATCCTAC
TATCAGGAGACCGGCCGCGCCGGGCGTGATGGTCTGCCGGCGGAAGCGATGCTGTTTTAC
GATCCGGCGGATATGGCGTGGCTGCGCCGCTGTCTGGAAGAAAAACCCGCCGGGCCGCTA
CAGGATATCGAACGGCATAAGCTGAATGCGATGGGGGCGTTTGCCGAAGCGCAGACCTGT
CGCCGTCTGGTGCTGCTGAACTATTTTGGCGAAGGGCGTCAGGAGCCGTGCGGCAACTGC
GATATCTGTCTTGACCCGCCAAAGCAGTACGATGGCTTAATGGACGCCCGCAAGGCGCTT
TCAACGATTTACCGGGTCAATCAACGCTTCGGAATGGGTTACGTGGTGGAGGTCCTGCGC
GGGGCCAACAACCAGCGCATCCGAGAGATGGGCCACGATAAGCTGCCGGTTTACGGTATC
GGCCGGGAGCAAAGTCACGAGCACTGGGTGAGCGTGATCCGCCAGCTGATCCACCTTGGG
CTGGTGACGCAGAATATCGCCCAGCACTCCGCGCTGCAGCTGACCGAAGCCGCGCGACCG
GTGCTGCGTGGCGAAGTGCCGCTGCAGCTCGCCGTGCCGCGTATCGTGGCGCTGAAGCCA
AAGGCGATGCAGAAATCCTTTGGCGGCAATTACGACCGTAAACTGTTCGCCAAGCTGCGC
AAATTACGTAAAGCGATCGCCGACGAAGAGAACATCCCGCCATATGTGGTCTTCAACGAC
GCGACGCTTATCGAGATGGCCGAACAATCGCCGCTGACCGCCGGCGAAATGCTCAGCGTC
AACGGCGTGGGGACACGCAAGCTCGAGCGTTTCGGGAAGCCGTTTATGGCGCTGATCCGG
GCGCATGTTGATGGCGACGATGAGTAG

Sorry for missing this earlier - this is indeed very odd since the two UniRefs correspond to very different (albeit related) protein sequences. Is this an isolated incident or have you seen this happen for other pairs?

It’s not an isolated incident. Here is the list: same_nucleotide_seq.txt (5.2 KB)

Thanks for the list - we are investigating this.

To better answer your initial question, the difference in RPKs probably comes down to bowtie2 breaking ties randomly when it finds two equivalently good hits. Using new options in HUMAnN 3.0 you can request more than the best hit from bowtie2 (in which case I’d expect to see any read that hits one of these sequences also hit the other). However, we found that the best hit approach tended to be more accurate (hence keeping it as the default).

When regrouping UniRefs to higher level categories I’d expect the read mass assigned to either of these genes to be grouped into similar categories. For example, if you look at the raw annotations for the corresponding UniRefs you can see they are (near-)identical for things like ECs, Pfams, etc.

https://www.uniprot.org/uniprot/A0A1A9P878.txt
https://www.uniprot.org/uniprot/A0A379QCR7.txt

Why the same gene ended up duplicates over two UniRefs is currently a mystery though. I’ll get back to you when we figure out what happened.