The bioBakery help forum

LEfSe input file question

Hello,

I created my input file from QIIME 2. However, there are a few duplicate rows because for example: the bacteria below differ at the genus level but belong to the same family. I don’t have enough information on which genus. Can I leave them as is or do I need to specify (in some way) that they are in fact different at the genus level before running LEfSe?

Bacteria|Actinobacteria|Actinobacteria|Bifidobacteriales|Bifidobacteriaceae
Bacteria|Actinobacteria|Actinobacteria|Bifidobacteriales|Bifidobacteriaceae

Thanks,
Akriti

Hi Akriti -
In this case LEfSe won’t be able to differentiate between the two features. Imagine one of the genera is significant and the other not - there would be no information to tell which is the significant one. I’d suggest specifying different names (such as OTU cluster IDs) before running LEfSe.
Thanks!
Siyuan

Hi Siyuan,

Thanks for answering my question! I have two follow-up questions:

  1. If I use OTU cluster IDs then the LDA plot will not show the taxa. How can I create the LDA plot in this case?

  2. I followed instructions at this QIIME2 forum https://forum.qiime2.org/t/lefse-after-qiime2/4496/8 to create the input file at the L6 level. A colleague who used LEfSe after QIIME1 a number of years ago said that he had to summarize the features to create the LEfSe input file. However, in the link that I followed it looks like a relative frequency table will work just as well. Am I missing something here or has LEfSe’s ability to process input files changed over the years?

Thanks!
Akriti

Hi Akriti -
For your first question - you’re right, only OTU cluster IDs will show up in the LDA plot. To help figure annotation, I might suggest creating your own feature names, by concatenating taxa names with OTU cluster IDs.
For your second question, I’m not sure what “summarize the features” meant with the QIIME1 case? The thread you posted indeed generates the appropriate input for LEfSe though.
Thanks!
Siyuan

Hi Siyuan,

My understanding is that the current input file is the feature table at the L6 level where each row is a different bacterial species and the columns are the relative frequency of that bacterial species for each subject.

After running QIIME1 my colleague created an input file for LEfSe where each phylogeny had a sum e.g. k__Bacteria|p__Actinobacteria|c__Actinobacteria was the sum of all features that had the class Actinobacteria and k__Bacteria was the sum of practically all features in the sample and was close to 1.

Do you know why the input files are different?

Thanks,
Akriti

Hi Siyuan,

I tried attaching a unique ID at the end (example below), but both Dialister and OTU 207 show up in the LDA plot (attached). Why is this happening?

Bacteria|Firmicutes|Clostridia|Clostridiales|Veillonellaceae|Anaerovibrio|OTU_206
Bacteria|Firmicutes|Clostridia|Clostridiales|Veillonellaceae|Dialister|OTU_207
Bacteria|Firmicutes|Clostridia|Clostridiales|Veillonellaceae|Megamonas|OTU_208

Thanks,
Akriti

Hi Akriti -
For your question on input feature levels: I believe LEfSe does not differentiate between taxonomy level of features. This means if you’re interested in testing all feature levels, the old approach is more appropriate; if you’re only interested in genus level results, your current approach is appropriate. Both should run for LEfSe.
For your figure name question, I’d suggest the following debugging steps:

  1. Make sure that Dialister is not just one feature in your input that didn’t get an OTU appended to. LEfSe by default subset feature names to their lowest taxonomy level for plotting. You can change this behavior by setting --subclades to -1 in plot_res.py. This will force the script to plot the full feature names, so that you can see if the duplicates were same/different features in the input.
  2. Also, LEfSe separates taxonomy levels in feature names by either the “|” or “.” symbol. So if you’re interested in having both the genus and the OTU number in the figure, you should attach OTU IDs using something else ("_" for example) (or, alternatively, set --subclades to 2 in plot_res.py).
  3. If these steps do not help, could you provide a minimal reproducible example, so that we could debug on our end? Feel free to mask away sensitive data when doing so.

Thanks!
Siyuan

Hi Siyuan,

Thanks for clarifying! I don’t use python so not sure how I would use the scripts you shared. I double checked and found that all features got an ID… Also, I use the web version of LEfSe.

I tried to attach my input file in this post but got an error message. Can I email you instead? If yes, what is your email?

Thanks,
Akriti

Sure Akriti. You can reach me at siyuanma@g.harvard.edu

Hi @sma,

@asingh14 had few additional question.

  1. Do both work with LEfSe? Taxa summarized with count and relative abundance?

  2. Is it ok not to add a trailing OTU at the end as this shows up in the plot?

  3. For the file named Metaphlan, does it need to be based on relative abundance or can it be count too?

Thanks,
Sagun