Humann3 chocophlan duplicate IDs

Billy_Law · August 12, 2021, 8:51pm

Hi,
I’m using the copy of chocophlan in my MT analysis pipeline, and I’m seeing an odd issue with the contents of chocophlan that came with HUMAnN3.

The issue is: There’s multiple copies of the same ID used in the entries.
for example:
655183__B3WE58__rpmF|k__Bacteria.p__Firmicutes.c__Bacilli.o__Lactobacillales.f__Lactobacillaceae.g__Lactobacillus.s__Lactobacillus_casei_group|UniRef90_B3WE58|UniRef50_Q7C3P5|192les

This ID is used 3 times, with 3 different sequences.
The produces a warning of duplicate IDs when I use samtools to parse my BWA run.

It was my understanding that we would get 1 unique ID per sequence. Is this no long the case?

franzosa · August 16, 2021, 8:34pm

There should be one gene sequence per UniRef90 per species. We may have ended up with duplicates here because this is a species group (a merging of independently defined species pangenomes). Any reads assigned to any of those sequences would be grouped by HUMAnN into the appropriate UniRef families.

If the non-unique names are causing issues outside of HUMAnN, you could modify the sequence headers to contain something unique, e.g. a prefix indicating the position of the sequence in the file, like 000123-. Do not include this as a separate |ed field since that would throw off HUMAnN’s indexing of other information in the header.

Billy_Law · August 16, 2021, 8:47pm

I’ll give this a try.
Thanks for the explanation, Eric!

Topic		Replies	Views
Different UniRef90 ID has the same nucleotide sequences in ChocoPhlAn database HUMAnN	3	514	August 4, 2020
ChocoPhlAn/UniRef 201901b vs 201901 HUMAnN	3	1139	September 3, 2021
Count of individual genes from ChocoPhLan database rather than UniRef gene family based RPK HUMAnN	2	468	January 8, 2021
Chocophlan inclusion criteria HUMAnN	3	247	May 22, 2023
Chocophlan to UniRef90 map HUMAnN	5	691	June 28, 2022

Humann3 chocophlan duplicate IDs

Related topics