LEfSe input file question

Hello,

I created my input file from QIIME 2. However, there are a few duplicate rows because for example: the bacteria below differ at the genus level but belong to the same family. I don’t have enough information on which genus. Can I leave them as is or do I need to specify (in some way) that they are in fact different at the genus level before running LEfSe?

Bacteria|Actinobacteria|Actinobacteria|Bifidobacteriales|Bifidobacteriaceae
Bacteria|Actinobacteria|Actinobacteria|Bifidobacteriales|Bifidobacteriaceae

Thanks,
Akriti

Hi Akriti -
In this case LEfSe won’t be able to differentiate between the two features. Imagine one of the genera is significant and the other not - there would be no information to tell which is the significant one. I’d suggest specifying different names (such as OTU cluster IDs) before running LEfSe.
Thanks!
Siyuan

Hi Siyuan,

Thanks for answering my question! I have two follow-up questions:

  1. If I use OTU cluster IDs then the LDA plot will not show the taxa. How can I create the LDA plot in this case?

  2. I followed instructions at this QIIME2 forum https://forum.qiime2.org/t/lefse-after-qiime2/4496/8 to create the input file at the L6 level. A colleague who used LEfSe after QIIME1 a number of years ago said that he had to summarize the features to create the LEfSe input file. However, in the link that I followed it looks like a relative frequency table will work just as well. Am I missing something here or has LEfSe’s ability to process input files changed over the years?

Thanks!
Akriti

Hi Akriti -
For your first question - you’re right, only OTU cluster IDs will show up in the LDA plot. To help figure annotation, I might suggest creating your own feature names, by concatenating taxa names with OTU cluster IDs.
For your second question, I’m not sure what “summarize the features” meant with the QIIME1 case? The thread you posted indeed generates the appropriate input for LEfSe though.
Thanks!
Siyuan

Hi Siyuan,

My understanding is that the current input file is the feature table at the L6 level where each row is a different bacterial species and the columns are the relative frequency of that bacterial species for each subject.

After running QIIME1 my colleague created an input file for LEfSe where each phylogeny had a sum e.g. k__Bacteria|p__Actinobacteria|c__Actinobacteria was the sum of all features that had the class Actinobacteria and k__Bacteria was the sum of practically all features in the sample and was close to 1.

Do you know why the input files are different?

Thanks,
Akriti

Hi Siyuan,

I tried attaching a unique ID at the end (example below), but both Dialister and OTU 207 show up in the LDA plot (attached). Why is this happening?

Bacteria|Firmicutes|Clostridia|Clostridiales|Veillonellaceae|Anaerovibrio|OTU_206
Bacteria|Firmicutes|Clostridia|Clostridiales|Veillonellaceae|Dialister|OTU_207
Bacteria|Firmicutes|Clostridia|Clostridiales|Veillonellaceae|Megamonas|OTU_208

Thanks,
Akriti

Hi Akriti -
For your question on input feature levels: I believe LEfSe does not differentiate between taxonomy level of features. This means if you’re interested in testing all feature levels, the old approach is more appropriate; if you’re only interested in genus level results, your current approach is appropriate. Both should run for LEfSe.
For your figure name question, I’d suggest the following debugging steps:

  1. Make sure that Dialister is not just one feature in your input that didn’t get an OTU appended to. LEfSe by default subset feature names to their lowest taxonomy level for plotting. You can change this behavior by setting --subclades to -1 in plot_res.py. This will force the script to plot the full feature names, so that you can see if the duplicates were same/different features in the input.
  2. Also, LEfSe separates taxonomy levels in feature names by either the “|” or “.” symbol. So if you’re interested in having both the genus and the OTU number in the figure, you should attach OTU IDs using something else ("_" for example) (or, alternatively, set --subclades to 2 in plot_res.py).
  3. If these steps do not help, could you provide a minimal reproducible example, so that we could debug on our end? Feel free to mask away sensitive data when doing so.

Thanks!
Siyuan

Hi Siyuan,

Thanks for clarifying! I don’t use python so not sure how I would use the scripts you shared. I double checked and found that all features got an ID… Also, I use the web version of LEfSe.

I tried to attach my input file in this post but got an error message. Can I email you instead? If yes, what is your email?

Thanks,
Akriti

Sure Akriti. You can reach me at siyuanma@g.harvard.edu

Hi @sma,

@asingh14 had few additional question.

  1. Do both work with LEfSe? Taxa summarized with count and relative abundance?

  2. Is it ok not to add a trailing OTU at the end as this shows up in the plot?

  3. For the file named Metaphlan, does it need to be based on relative abundance or can it be count too?

Thanks,
Sagun

Hello and thank you for the package,

I am wondering if there was a final resolution to this post? I am struggling with a similar situation and wondering if I should post all the details or if a solution is known?

Thanks
-Amber

Hi @damselflywingz ,

Can you give us just a little more information on what your issue is? There are several addressed in the above thread.

Best,
Kelsey

Hi @Kelsey_Thompson,

Okay since I first posted I have found a solution, so I thought I would post my code in case others want to know how to remove the LEfSe tests that are happening for higher taxonomic classifications. I work with ASVs and I wanted to run LEfSe tests on individual ASVs labelled for their lowest possibly taxonomic assignment. I hope this will help, as I appreciate the tool package and also making it available to the community :smiley:

I then wanted to follow-up to see if you think anything about my approach would be incorrect in analyzing my microbiome data this way or explain further the rationale on why the package includes analysis for higher-order taxonomic levels?

Thanks,
-Amber

############################LDA and Lefse analysis
#to remove the higher taxonomic rankings in the LDA, reduce the tax_table to only a single column that includes the lowest taxonomic assignment at genus-level concatenated to the ASV identfier
#note you may need to modify your tax_table for lowest taxonomy (see:Rename "NA" to "unspecified + [last identified taxa]" · Issue #850 · joey711/phyloseq · GitHub)

#extract the taxonomy table from the ps object
tax.out = data.frame(tax_table(ps))

tax.out3 = tax.out
names = rownames(tax.out3)
tax.out3[,“ASV”] = names
tax.out3$Species_new = paste(tax.out3$Genus, tax.out3$ASV, sep="_")

new_tax.out3 = dplyr::select(tax.out3, -Species, -ASV)
new_tax.out4 = dplyr::select(new_tax.out3, -Genus)
colnames(new_tax.out4)[6] ← “Genus”
new_tax.out5 = dplyr::select(new_tax.out4, -Kingdom, -Phylum, -Class, -Order, -Family)

#create a new ps object to replace with the reduced taxonomy table
ps2 = ps
tax_table(ps2) ← as.matrix(new_tax.out5)

Linear discriminant analysis (LDA) effect size (LEFSe) -

library(microbiomeMarker)

#lefse analysis returns a microbiome biomarker stored in marker_table-class
mm3 ← lefse(ps2, norm=“none”, class=“Location1”, lda_cutoff=2,multicls_strat=TRUE)

#plot
plot_ef_bar(mm3, label_level=1, max_label_len = 100)+
scale_fill_manual(values=myCol)

#####end

Hi @damselflywingz,

Glad you were able to find an answer! Basically, for LEfSe it will separate the taxa names to the taxonomic levels if there is a “|” or and “.” separating them. So you can either clean them like you do above or change all the of the | to a different character like an _.

Thanks for posting your code for others!

Best,
Kelsey

1 Like

Hello,
I opened this topic again because I made two tests with two different input files and I would like to understand why I don’t get the same results.

Let me explain:
1st matrix: abundance matrix with bacterial genera but the same genus can be repeated because I just changed my ASV column by the genus of the ASV in question.

2nd matrix: abundance matrix with bacterial genera but no repetition of genus as I have summed up my abundances by genus.

Does LEfSe group similar names together before doing the analyses? But if it does that, I should have the same results?

Thank you for your help,
Best,
Megane.