Hello biobakery help forum,
I have two questions on the pathway RPKs sum from humann3 output files.
I first looked at individual raw pathabundance.tsv for each sample, the sum of RPKs for each pathway does not equal to sum of pathway|species. I also looked at the collapsed pathabundance_cpm.tsv. The result table also had the same issue. I see humann3 code is very straight forward one line with minimum modification we can do so I am not sure what can possibly go wrong here. Please see a screenshot of this issue attached.
I used masslin2 to find significant pathways. I was trying to explain why this pathway is significant and noticed many species were unclassified in specific pathway, infact, after filtering out pathways in more than 10% of sample, all contribution came from unclassified species, … making it hard to explain it from species level simply because they are unclassified. Do you have suggestions on how I should explain pathways from unclassified species?
Re: 1 - This is expected because the math for computing pathways is non-linear. Imagine a pathway with two reactions A and B. If Species1 has A=5 and B=2, then we say Species1 has 2 “complete copies” of the pathway (driven by the weakest link, B). Conversely if Species2 has A=2 and B=5 then it contributes 2 complete copies limited by A. But at the community level A=7 and B=7 and there are 7 complete copies (ignoring boundaries between species). The output would then look like…
I.e. with community NOT being a sum over species. There are other non-linear choices in pathway abundance computation that also disrupt this additivity, but this is the big one. In contrast, all other abundances from HUMAnN (genes, reactions, domains, etc.) do sum as you would expect. Pathways are unusual in that we require the copies to be complete to be counted, and so (e.g.) doubling only the abundance of reaction A doesn’t double the pathway abundance.
Re: 2 - Unclassified abundance is usually attributable to one or more microbial species that we aren’t catching for some reason (e.g. they are sufficiently diverged from the species that MetaPhlAn knows about and so their reads are mapped during translated search rather than pangenome search). Unclassified abundance can also result (in theory) from non-microbial contamination in the sample.
Hello @franzosa, thank you very much for your answer! So since the species level pathway is “more complete”, do you think it’s better to do MaAsLin2 based on species instead of the summed pathways? I am currently doing the summed pathway significance and then finding contributing species from species.
I usually follow your approach: i.e. do the testing at the community level and then, for pathways that are interesting/significant, comment on which species are causing them to change. It’s also fine to test at the species level, but that means doing more tests AND it tends to highlight “hitchhikers” (i.e. functions that are not mechanistically relevant but change because the species that encodes them is changing in abundance).
Yes. I agree if doing the MaAsLin2 by species level, I will run into too many multiple comparisons, getting more false positive pathway|species. Thank you!
Hello Eric, so this is the same reason that the pathway relative abundance sum of individual species does not equals to 1, right? pathways do sum up to 1.
That’s right - by default the normalization script forces the sum of the community totals to 1 and then normalizes the species stratifications against that total (since they are less likely to be complete following my explanation above, their total is smaller).
You can change this behavior if desired by setting the normalization mode to “levelwise”: this will normalize species stratifications over their own sum rather than the community sum.