I was wondering how should pathway abundances be interpreted when some of the pathways are nested in others?
As an example, in an analysis I’m running I’ve got the following MetaCyc pathway abundances:
OANTIGEN-PWY … 1263.8116
DTDPRHAMSYN-PWY … 2243.2606
UDPNAGSYN-PWY … 962.1098
Now, OANTIGEN-PWY is a super-pathway that is composed of the other two, DTDPRHAMSYN-PWY and UDPNAGSYN-PWY. In the documentation (and forum answers) it is recommended to turn abundances into relative abundances (i.e. divide by sample total). But unlike gene abundances or species abundances, here - these pathways aren’t independent entities, like in my example, no? Wouldn’t this kind of normalization distort comparisons between samples? Is there any specific recommendation about how to deal with these cases? (e.g. drop super-pathways somehow?)
Many many thanks,
In practice normalizing the pathways in this way seems to work OK. The resulting fractions are still proportional to the pathways’ copy numbers in the original sample, and the normalization corrects for differences in sequencing depth across samples.
Another approach is to normalize the gene family abundances (where no read mass is double-counted) from RPKs to CPMs and then run the normalized genes back through HUMAnN to directly compute pathway abundance in CPMs. The resulting pathway abundances will NOT sum to 1M in that case, but they will be corrected for sequencing depth, and this avoids any artifacts that might arise from sum-normalizing over overlapping pathways. (Recomputing pathways by providing gene family abundances as an input file is very fast.)