I’m using the copy of chocophlan in my MT analysis pipeline, and I’m seeing an odd issue with the contents of chocophlan that came with HUMAnN3.
The issue is: There’s multiple copies of the same ID used in the entries.
This ID is used 3 times, with 3 different sequences.
The produces a warning of duplicate IDs when I use samtools to parse my BWA run.
It was my understanding that we would get 1 unique ID per sequence. Is this no long the case?
There should be one gene sequence per UniRef90 per species. We may have ended up with duplicates here because this is a species group (a merging of independently defined species pangenomes). Any reads assigned to any of those sequences would be grouped by HUMAnN into the appropriate UniRef families.
If the non-unique names are causing issues outside of HUMAnN, you could modify the sequence headers to contain something unique, e.g. a prefix indicating the position of the sequence in the file, like
000123-. Do not include this as a separate
|ed field since that would throw off HUMAnN’s indexing of other information in the header.
I’ll give this a try.
Thanks for the explanation, Eric!