RNA and DNA CPM as covariates

Hi,

I’d like to run a model you ran in the MaAsLin2 preprint but I have a few questions about arranging the data and metadata files to replicate it.

I have longitudinal, paired metatranscriptome and metagenome data from individual patients . For each RNA and DNA sample I determined the read counts to the gene level and normalized them by TPM per sample. I’ve also grouped/stratified these TPM values per sample by functional pathway/taxonomy. Given that, I essentially would like to replicate this model described in the preprint:

Where “the log-transformed relative abundances of the whole-community and species-stratified RNA pathways are…modeled…while additionally adjusting the corresponding DNA pathways abundance as a continuous covariate to filter out the influence from gene copies”

Question 1:
I can’t figure out how to set up the input data and metadata files to accomplish this. Specifically, it’s unclear to me how both the RNA and DNA abundance values are incorporated into these files (the tutorial data set only uses one abundance value per feature) and then appropriate way to reference them in the metadata for MaAsLin2 to use RNA as the “intercept”? Would you able to share or describe a template data/metadata files for this kind of RNA and DNA covariant analysis?

Question 2:
You say when the above model is run on data grouped to the pathway level and/or stratified by taxonomy “this considers a per-feature DNA covariate model, in which per-feature normalized transcript abundance is treated as a dependent variable, regressed on per-feature normalized DNA abundances along with other regressors in the model”. Is “per-feature normalized abundance” the same thing as just using TPM values (calculated per sample individually based on the raw feature read counts in each sample, and summed/stratified appropriately to pathway/taxonomy levels per sample) as the feature abundance values?

Question 3:
I noticed that the result of the above model were not included in the preprint. Although Figure 5 shows the results of running MaAsLin2 using either RNA abundance or DNA abundances - I couldn’t find the results of the model using RNA abundances with DNA abundance as a covariate. Should I read into that as running RNA and DNA separately and then comparing the results is more appropriate than using DNA as a covariate for transcription levels?

Thank you for any guidance you can provide!

Hi @okeydokey - thanks for your interest in our work. Great questions - I will go one by one but in the reverse order:

Question 3:

Figure 5C is indeed based on the output of the above model where DNA is used as a covariate - apologies if it was not clear. The DNA pathways are analyzed similarly to the species profiles (i.e. adjusting for clinical metadata only) as described in the preprint.

Question 2:

We are directly using the HUMAnN 2 relative abundance values in RPK units both stratified by species and community-level. We additionally apply TSS normalization to account for the variable sequencing depth across samples. My best guess is that HUMAnN 2 is doing something similar to what you are describing for calculating normalized abundances at both levels but I would double-check with the HUMAnN 2 developers (@franzosa @lauren.j.mciver) to make sure.

Question 1:

As you have already figured out, without some nontrivial manipulation of the input data and code, it is currently not possible to apply this model using the MaAsLin 2 default pipeline. I am tagging @YancongZhang who did this part of the analysis and she can point to the appropriate repository.

Please let us know if you have any other questions.

I can confirm that the TSS-normalized RPK values from HUMAnN are equivalent to TPM values. We avoid saying “TPM” since we are often quantifying at the DNA level (and the “T” in TPM refers to transcripts). HUMAnN’s term for this unit is CPM or CoPM, indicating generic Copies Per Million (not to be confused with Counts Per Million, which would indicate an absence of adjustment for sequence length).

Re. 1: Based on MaAsLin2, we built a relevant package to handle this kind of DNA covariate analysis in metatranscriptomics. This new package keeps all features of MaAsLin2, and add additional parameters for DNA covariate modeling. Meanwhile, we also provide a demo run. Please feel free to check out this repository: GitHub - biobakery/MTX_model: R package for differential expression analysis in metatranscriptomics

Thanks everyone, love your work.

@himel.mallick Thank you for the info and helping me out with Figure 5C - very helpful!

@franzosa Thank you for the input on on the TPM/RPK values!

@YancongZhang This is exactly what I needed, thank you!

Hi @himel.mallick and @franzosa,

Thank you again for your very helpful responses! I have a follow up question now that I’m in the middle of using these tools.

I have my TSS-normalized RPK values as produced Humann2 (which I’ll call CoPM here) for each feature at the RNA and DNA level, where each sample has a very similar total CoPM of ~1,000,000 as expected. When I then filter my data for quality control (so that only features that are of found in “enough” samples at “appropriate” abundances remain) my samples now end up with varying total CoPM based on this new subset of features. The range is between 450,000 CoPM to 850,000 CoPM across all samples.

When I feed this quality controlled dataset into Maaslin2, should I let it run TSS “again” on these CoPM data to address how different the total CoPM are across samples, or is there is a more appropriate normalization method given that these are already TSS-normalized values?

Thanks for any guidance you have!

You don’t NEED to normalize again in this case - you want to do some sort of normalization once to adjust for differences in sequencing depth between samples, which you’ve accomplished with the TSS transformation to CoPM units.

If you renormalize again the meaning of the values changes to something more like “proportion of the things I care about” vs. “proportion of mapped reads” (doing so is not strictly wrong, it’s just philosophically different and perhaps unnecessary). I prefer to stop at the first normalization and then, if I filter out some features, replace them with a single per-sample “other” feature that absorbs the removed mass (such that the filtered samples still have total abundance of 1M in this case).

Unrelated to the main question, filtration that removes >50% of the sample (as in your 450K CoPM case) seems pretty extreme. That’s what I’d expect for a sample that had e.g. a really high abundance of some sort of rare, technical contaminant.

Thanks for the quick reply, and again, for the incredibly helpful input.

Keeping in a single per sample “other feature” is a great idea thanks!

Unrelated to the main question, filtration that removes >50% of the sample (as in your 450K CoPM case) seems pretty extreme. That’s what I’d expect for a sample that had e.g. a really high abundance of some sort of rare, technical contaminant.

Yea this kind of bothers me too. A qualitative look at the data doesn’t seem weird though - the number of unique functional annotation features at the metagenome/transcriptome level ranges from ~3000 to ~5500 across all of my samples. If I consider only the features hit in “most” samples and also at both the RNA and DNA level I have a list of ~2500 features. Am I thinking about this right that when considering this ~2500 feature list, the total CoPM vary so much between samples because the 5000+ feature samples spread their CoPM out over more features, leaving a big chunk of TSS-normalized counts behind when you dump the ~2000 features that were not consistently hit in most samples?

Based on the wet lab experiment it doesn’t feel biologically appropriate for me to consider features that were not reliably seen in all of my samples, but I understand the bioinformatic benefits of pseudocounts for zeros in order to leave all/most of the features in. The reason I avoided that at first is that some other pipelines I’m feeding the CoPM data into penalize large feature lists during their default false discovery correction, so I figured I’d start with pruning the lists down to feed the same exact data into Maaslin and my tangential pipelines. It didn’t seem like a feature list of 2500 was too strict, but perhaps a 2x swing in total CoPM is too aggressive?