The bioBakery help forum

Metatranscriptomic data with MaAsLin2 & LEfSe

Hi all,

I’m currently running some metatranscriptomic data (without accompanying metagenomic data) through Humann2.
The study design is quite simple, disease state vs. healthy controls, and I would like to find out what taxons are differentially abundant (I understand that abundance is to be taken with a grain of salt in this case due to the nature of single copy marker genes), which gene families/pathways are differentially expressed, and look at any associations between taxonomy/function and clinical metadata.
My question is whether MaAsLin2 and LEfSe are suitable for statistical analysis on the Humann2 functional outputs for this type of data? I would normally use DESeq2 for RNA-Seq data if I was aligning to reference genome(s) but I do not think it would be suitable here as I will not have raw count data.

Any advice would be greatly appreciated,
All the best,
Calum

You could definitely use MaAsLin 2 for this. My suggestion would be to sum over genes within species from the HUMAnN gene families output to get an estimate of each species’ total RNA contributions. (Using MetaPhlAn to detect species from RNA in the first step of HUMAnN should still be fine, but the total gene RNA coverage will be a better estimate of RNA abundance than an average over marker RNA coverage, as MetaPhlAn reports.)

You could then sum-normalize those RNA abundances and directly analyze them to look for changes in species-level structure (noting that the changes will reflect a combination of species abundance and total expression).

If you analyze individual, stratified molecular functions, I would use the corresponding species’ total RNA as a covariate in your model. E.g. to compare gene1|species1 RNA values with a covariate of interest, include species1 total RNA abundance (as computed above) as an additional covariate. This will help to distinguish changes in expression of gene1|species1 specifically from changes in species1 abundance (which will affect all of its genes). In the absence of paired metagenomic data, we’ve found this approach to be much less prone to false positives than the alternative (i.e. not using the species covariate).

Thanks @franzosa for the advice.
I have finally come back to working on this dataset and it seems to be working well so far.
Can I please double check that the species covariate recommended for MaAsLin2 refers to the sum of the RPKs of that species as opposed to sum-normalised RPKs?
(i.e. not the sum-normalised RPKs used to look for changes in species-level structure)

Thanks again,
Calum

I would sum-normalize transcript RPKs to something else (e.g. CPMs) and then use those values both 1) as the dependent variables in the models and 2) for summing to get the source-species covariate. Otherwise you’d need to do something else to account for differences in sequencing depth between samples.

One other note on this based on some recent evaluations we’ve done: when running the model for transcript X, I recommend only including samples where X was non-zero (i.e. recruited at least one read). This helps to further weed out spurious associations driven by differences in encoding (e.g. lack of transcript driven by loss of gene).

Thanks again for the advice @franzosa.

I have another followup question if you please give it some thought.
I’m attempting to calculate Phylogenetic Diversity and Unifrac distances based on the species-level profile inferred from the gene family abundances.
I’m running into issues as the chocophlan database contains a combined
Streptococcus_mitis_oralis_pneumoniae pangenome while the metaphlan v20 db contains markers for each of those individual species (I’m using that database and the v20 phylogenetic tree from the CuratedMetagenomicData package for R).
My first though was to merge those leaves on the tree and recalculate their phylogenetic distance to every other Streptococcus species based on the average distance to those three leaves. Do you know of any software that would be able to do this (if it is even appropriate)?

Thanks,
Calum

The approach makes sense to me - I had a student follow a similar approach for computing distances between taxa at different ranks. I believe she used ETE Toolkit to do the actual computations:

http://etetoolkit.org/

Hi @franzosa,

I ended up using LEfSe to detect differentially abundant species and differentially expressed functions between the two disease states based on the HUMAnN 3.0 output.
To account for variations in species abundance before passing to LEfSe I normalised the CPM for the genefamily/pathway by dividing by the sum of that species’s CPM and multiplying by 1,000,000.
Does that sound appropriate?

All the best,
Calum