The bioBakery help forum

Metatranscriptomic data with MaAsLin2 & LEfSe

Hi all,

I’m currently running some metatranscriptomic data (without accompanying metagenomic data) through Humann2.
The study design is quite simple, disease state vs. healthy controls, and I would like to find out what taxons are differentially abundant (I understand that abundance is to be taken with a grain of salt in this case due to the nature of single copy marker genes), which gene families/pathways are differentially expressed, and look at any associations between taxonomy/function and clinical metadata.
My question is whether MaAsLin2 and LEfSe are suitable for statistical analysis on the Humann2 functional outputs for this type of data? I would normally use DESeq2 for RNA-Seq data if I was aligning to reference genome(s) but I do not think it would be suitable here as I will not have raw count data.

Any advice would be greatly appreciated,
All the best,

You could definitely use MaAsLin 2 for this. My suggestion would be to sum over genes within species from the HUMAnN gene families output to get an estimate of each species’ total RNA contributions. (Using MetaPhlAn to detect species from RNA in the first step of HUMAnN should still be fine, but the total gene RNA coverage will be a better estimate of RNA abundance than an average over marker RNA coverage, as MetaPhlAn reports.)

You could then sum-normalize those RNA abundances and directly analyze them to look for changes in species-level structure (noting that the changes will reflect a combination of species abundance and total expression).

If you analyze individual, stratified molecular functions, I would use the corresponding species’ total RNA as a covariate in your model. E.g. to compare gene1|species1 RNA values with a covariate of interest, include species1 total RNA abundance (as computed above) as an additional covariate. This will help to distinguish changes in expression of gene1|species1 specifically from changes in species1 abundance (which will affect all of its genes). In the absence of paired metagenomic data, we’ve found this approach to be much less prone to false positives than the alternative (i.e. not using the species covariate).

Thanks @franzosa for the advice.
I have finally come back to working on this dataset and it seems to be working well so far.
Can I please double check that the species covariate recommended for MaAsLin2 refers to the sum of the RPKs of that species as opposed to sum-normalised RPKs?
(i.e. not the sum-normalised RPKs used to look for changes in species-level structure)

Thanks again,

I would sum-normalize transcript RPKs to something else (e.g. CPMs) and then use those values both 1) as the dependent variables in the models and 2) for summing to get the source-species covariate. Otherwise you’d need to do something else to account for differences in sequencing depth between samples.

One other note on this based on some recent evaluations we’ve done: when running the model for transcript X, I recommend only including samples where X was non-zero (i.e. recruited at least one read). This helps to further weed out spurious associations driven by differences in encoding (e.g. lack of transcript driven by loss of gene).

Thanks again for the advice @franzosa.

I have another followup question if you please give it some thought.
I’m attempting to calculate Phylogenetic Diversity and Unifrac distances based on the species-level profile inferred from the gene family abundances.
I’m running into issues as the chocophlan database contains a combined
Streptococcus_mitis_oralis_pneumoniae pangenome while the metaphlan v20 db contains markers for each of those individual species (I’m using that database and the v20 phylogenetic tree from the CuratedMetagenomicData package for R).
My first though was to merge those leaves on the tree and recalculate their phylogenetic distance to every other Streptococcus species based on the average distance to those three leaves. Do you know of any software that would be able to do this (if it is even appropriate)?


The approach makes sense to me - I had a student follow a similar approach for computing distances between taxa at different ranks. I believe she used ETE Toolkit to do the actual computations:

Hi @franzosa,

I ended up using LEfSe to detect differentially abundant species and differentially expressed functions between the two disease states based on the HUMAnN 3.0 output.
To account for variations in species abundance before passing to LEfSe I normalised the CPM for the genefamily/pathway by dividing by the sum of that species’s CPM and multiplying by 1,000,000.
Does that sound appropriate?

All the best,

Hi @franzosa and @cazzlewazzle89,

I was wondering if there are any new developments to this?
I’m thinking of also analyzing my transcriptomics data with MaAsLin2 (or possibly LEfSe) and also had the normalization in mind that is mentioned in the last comment.

Thank you,

I would recommend doing these metatranscriptomics (MTX) analyses in MaAsLin so you can easily add covariates. If you’re working with RNA data only (no paired metagenomics/DNA), then our preferred approach is:

  • Sum-normalize transcripts to CPMs
  • Add up each species’ transcripts as an RNA-based abundance estimate
  • Sum-normalize transcripts within-species, sometimes called “within-taxon scaling”
  • Ignore samples where a given transcript was 0 (treat them as uninformative)
  • Model log(transcript) ~ phenotype + log(species abundance)

We’ll have a paper coming out at ISMB that compares this with other approaches and shows it to be the best option (of those we tried) when you don’t have paired DNA.

Thanks for the prompt answer!
Would you recommend a different approach if you have paired DNA?
(I should clarify that I’d like to look at expression of different gene families, not differential abundance of taxa.)


With paired DNA, a ratio model:

  • log(gene RNA / gene DNA) ~ phenotype

And a gene-DNA covariate model:

  • log(gene RNA) ~ phenotype + log(gene DNA)

Both performed well, with the latter being slightly more robust to zeroes. Filtering RNA zeros still seemed to help with these models, though RNA zeros paired with DNA non-zeros are somewhat more informative for reduced expression.

Also, a couple of clarifications about my previous post:

[1] The purpose of including the species abundance covariate is not because you are interested in the species abundance, but rather to help control for fluctuations in species-normalized transcript abundance that occur at low species abundance.

[2] I originally neglected to indicate that the transcripts should also be normalized within-species for that model (I will update the post above to correct this). This effectively transforms an MTX analysis into something like a stack of single-organism RNA-seq analyses.

Thanks for the clarifications, this is very helpful!

One more question:
To run the covariate model you suggest in MaAsLin2, I would provide both phenotype and log(gene DNA) as fixed effects?
And with log(gene DNA), do you mean overall DNA reads of that species in each sample? Or actually DNA per gene (but then I’m unsure how to add it as a fixed effect…)?

Thanks so much!

Yes to providing both as fixed effects. The DNA covariate would be the per-sample DNA abundance of the functional unit you’re modeling at the RNA level (be it a pathway, reaction, gene, etc.). As-is I think you’d need to run this one-at-a-time in MaAsLin - it’s on our to-do list to make running this for many functions more automated.

Ah I see, thanks a lot!
I thought it would have to be one-at-a-time, but wasn’t sure.