I mistakenly had asked this question in the MetaPhlan subject but meant to ask this about Humann, so I am reposting my question below (sorry, I am new to this suite of tools and confused the names):
I recently joined a lab and was provided two separate relative abundance output files from Humann on two sets of samples (diseased vs control samples) that were sequenced on the same run.
My question is, would it be more appropriate to rerun these samples through Humann all together since they were sequenced on the same run and I need to compare these samples altogether? Or, is it okay to just merge their pathway relative abundances and go forward with my statistical analyses?
My thought is to rerun all these samples together through Humann to get a more accurate depiction of the pathways identified and their coverages across samples since the coverage is relative and depends on which pathways are identified in each separate run through Humann, but I wanted to confirm by asking here. Thank you!
For future reference, I believe you can move your post from one “topic” to another. If that’s not something normal users can do, we moderators definitely can, so you can always ask for help moving the post.
For your actual question, I’m not sure I understand what you are asking? HUMAnN always operates on one sample at a time, so if each case and control sample were analyzed separately, and then the profiles were merged into a single table, that’s the right way to do things (and you’re already done!).
I tried to remove my original post but didn’t have permission, so I flagged it as was instructed by the website. As far as my question goes, let me clarify it further:
One of my datasets (let’s call it A) has 14,742 total pathways identified (including general and species-specific pathways). My other data set has 7,565 pathways identified.
My thought is: if A and B both share Pathway 1 (not species-specific) at similar coverages, but data set A has more pathways observed across the entire dataset than B, then it would appear after the relative abundance calculation that dataset A has a lower coverage than B - but that result could be a reflection of the relative abundance calculation and not the actual coverage of Pathway 1 in data set A vs. B.
That is why I am concerned about merging the pathway relative abundances for A and B together and then analyzing them, rather than just rerunning these datasets through Humann together.
I hope this clarifies the question a bit further! Thank you for your help!
Thanks for the clarification! I see what you are saying now. Because HUMAnN analyzes each sample independently (including normalization), I don’t think any aspect of rerunning HUMAnN will change this. Samples in dataset A have more pathways than B, so indeed (all else being equal) relative abundance normalization will tend to depress the abundances of A pathways and inflate the abundances of B pathways. This is a general property of compositional statistics and not necessarily wrong philosophically: there are more unique “things” in dataset A so proportionally each “thing” has lower relative abundance. It’s an important point to keep in mind when interpreting results, however: pathway X increasing in B samples vs. A samples doesn’t necessarily mean it is more advantageous in B.
One approach to this “problem” is so-called genome-size normalization, as implemented by MUSiCC and MicrobiomeCensus. With these methods, instead of normalizing functions’ abundances to the sum over all abundances, you specifically normalize to the average abundance of a set of universal single-copy marker genes. Abundances are then expressed as something more like “% of cells in the community that have the function” which are less vulnerable to compositional effects.
This helps, thank you so much for this explanation! I am now a bit more curious about the conclusions we can make based on the results of pathway X in B samples vs A samples.
When you say
could you please elaborate?
Are you saying that you can’t make that conclusion based on the differences in raw relative abundances of pathway X between B samples vs A samples?
My perspective is - If pathway X has a significantly greater coverage in B samples vs A samples, that could indicate its more prevalent (either in some taxa that have multiple copies of genes in pathway X, or maybe pathway X is well distributed across the microbiome), which suggests it is a useful mechanism of survival in the microbes from the B samples.
I am just curious as to your thoughts based on your reply. Thank you so much again for your time and thoughtful response!
If you’re talking about relative abundance, if Pathway X increases in B samples vs. A samples, it is hard to know if X is increasing in B or if so many other things are lost in B that X is just expanding in relative abundance space. Consider:
A = {X, Y, Z}
B = {X}
X is a constant there, but its relative abundance is 3x higher in B due to the loss of Y and Z!
These effects aren’t quite as marked in prevalence (presence/absence) unless the function you’re talking about is very rare, in which case it might go from “undetectable” to “detectable” in A vs. B as other functions are lost and its abundance expands (i.e. from below to above the detection limit). These effects also aren’t as marked if A vs. B have similar alpha diversity. We mainly worry about these effects when comparing environments or conditions where there is a marked difference in species-level diversity.
Thank you for clarifying. So if I wanted to compare the functional beta diversity of A and B for example to compare a set of specific pathways between them, would using a presence/absence metric like Jaccard distance be more appropriate?
I’ve seen several papers arcsin(square root()) transform these pathway relative abundances before downstream statistical analyses but I would appreciate your insight whenever you have a moment.
Again, thank you so much for your generosity, time, and patience with answering these inquires!