Part of this topic arises from the base question: how does taxonomic stratification using metaphlan’s results work for Humann3?
Using metaphlan to stratify the functional profiles is mentioned in the user manual (GitHub - biobakery/humann: HUMAnN is the next generation of HUMAnN 1.0 (HMP Unified Metabolic Analysis Network).), and stratified results are described both in the manual and in the tutorial (humann3 · biobakery/biobakery Wiki · GitHub). However, how this works in theory/practice is much harder to discover.
I ask because each Uniref90 entry already has taxonomic information assigned, for its representative and the other members of the cluster. I don’t quite understand why this information is not used, and/or why/how stratification with metaphlan results are used instead.
This leads to the phenomenon that, for every stratified function, there are at least two sets of taxonomic information - one from uniref and one from metaphlan. This leads then to potential discrepancies, which exist. For example, I looked for the first instance of Escherichia and found UniRef90_A0A011NDA1|g__Escherichia.s__Escherichia_coli. The uniref90 ID is associated with an organism from a different order (https://www.uniprot.org/uniref/UniRef90_A0A011NDA1; UniProt) than metaphlan’s stratum (UniProt). I assume metaphlan’s strata is less correct, but the taxonomic information for the uniref90 id is not readily available from default outputs. It is further unclear how this discrepancy affect the results when converting the uniref90 ids to something more tractable.
There’s some confusion here about MetaPhlAn’s role in the process: MetaPhlAn is being used to decide which functionally annotated pangenomes to align against in the second tier of HUMAnN’s search. For example, if MetaPhlAn found Species X, then we’d include the Species X pangenome in the next level of the search. The Species X pangenome includes genes annotated with UniRef IDs. So, for example, if the Species X pangenome includes UniRef90_ABC, you might see a row like:
UniRef90_ABC|s__Species_X
In your gene families output. In that case, the “Species X” taxonomy is coming from the fact that the gene was found in the Species X pangenome, not from MetaPhlAn per se.
As for the UniRef90 taxonomy, recall that UniRef90s are protein clusters: representing multiple sequences with 90% amino acid identity and 80% coverage of a seed sequence. One such sequence, ABC, was chosen to represent the cluster based on its annotation, and provides the ID for the family (UniRef90_ABC in this case). ABC itself might not derive from Species X, but some other member of the UniRef90_ABC cluster ought to. That said, in practice, the majority of UniRef90s are species-specific.
Ah, right, I do recall learning how metaphlan is leveraged before (and even described it elsewhere), but i could not easily find it again (so largely user-error, sorry!). However, I am not sure that this is clear, given that the “g__Escherichia.s__Escherichia_coli” is a prototypical metaphlan result while UniRef90_A0A011NDA1 is a prototypical humann result. Would you mind sharing a link to the documetnation so I don’t forget it (again) and post about not understanding it (again)?
However, I think a problem still exists: why are the lineages of a gene (cluster) and the species pangenome in which it is present so divergent? In this specific example, the gene cluster (uniref id) has 3 members, all of which are from the same family but a different Order than the species pangenome (species id), and in particular, for one of the most well-studied organisms in all of science. I understand that the uniref ids are just the functional annotation, but for something like E coli, I would expect functions in species pangenome to originate almost entirely from E coli uniref entries.
Given the, effectively, hidden taxonomic information of the uniref genes, it is unclear how prevalent this phenomenon might be, and how it might impact the results.
Ultimately, this is not an issue of humann (or metaphlan) then, but rather of the messiness of MAGs/SGBs that are deposited to build the databases that are being used (perhaps with no better alternatives ).