Hello, I found this post from a previous conversation in the previous google groups forum that essentially gets exactly at what I want to know and was never answered, so I am hoping to re-open the query here with credit to the original poser of the question Juan Escobar:
"Dear Nicola,
I’m a LEfSe user since long time. My colleagues and I recently submitted a paper with some LEfSe results. One of the reviewers asks us to correct such results for multiple comparisons. I can calculate q-values with R but my question is more fundamental. Do you think it’s necessary to correct for FDR the output of LEfSe? I mean, the algorithm is super strict in choosing biomarkers: first Kruskal-Wallis test to choose features differentially distributed among classes, next pairwise Wilcoxon test applied to the retained features and finally LDA bootstrapping support. FDR control should be performed on each step then. I’m aware of a paper you recently published (Rooks et al. 2014, The ISME Journal) in which you performed Benjamini & Hochberg FDR correction on LEfSe p-values. However, I’m wondering if such correction is necessary altogether.
Your thoughts on this topic are welcomed since as far as I have followed publication of LEfSe results, no one applies FDR.
This thread got no reply for over two years, however, I think many people are interesed in this topic as the LEfSe original paper got citation from over 3000, and more and more papers using LEfSe are published nowadays.
In the section “Subclass structure variants encoding different biological hypotheses” of the original paper, they argued that “In both settings, we explicitly require all the pairwise comparison to reject the null hypothesis for detecting the biomarker; thus, no multiple testing corrections are needed.”. This is the setting in which we perform pair-wise comparison among subclasses, however, if we have no subclasses and only per-feature K-W test were to performed, I think we need to correct the p-value somehow.
As the question from the previous thread stated, many of the paper using LEfSe did not apply corrections of p-value, and biological interpretation of the results would be very different if or not applied correction, any input from the authors is I think helpful.
Hi -
Sorry for the late response. The p-values from LEfSe do require some kind of multiple comparison control, if you’re not using any other biological substructure. But if you do include any of the multiple-class-consistency or directionality tests (i.e. subclass), those can often be more strict than standard MHT correction just by themselves.
darmecian, you correctly pointed out that p-value correction is needed when only KW was performed. LEfSe by default only output significant p-values, whereas multiple comparison correction procedures often require the full list of p-values. To obtain this, you can set the alpha level in LEfSe to a high value (for example, 1), which will make LEfSe output all p-values. This can be performed by setting flag -a 1 in run_lefse.py.
Hope this helps!
Siyuan
Thank you very much for the help! I understand what you mean.
One more question is that, if we use KW test, and also perform LDA (with some threshold), is it still needed to perform multiple comparison control?
As you stated in the other thread, If those features passed those two test, I think these are considered to be biological biomarkers even if not performed multiple comparison correction.
Sorry to post a new reply, as I could not edit the previous reply of mine.
The rationale for not doing multiple comparison control (like BH, or Stoley) is that I thought KW test is to identify the candidate for input into LDA effect size estimation. In this context, multiple comparison control on KW is a bit overly conservative. Additionally, applying multiple comparison control after effect size estimation is somewhat not appropriate, and thresholding effect size is more appropriate as the flowchart of LEfSe is KW->(Wilcoxon)->LDA, and the last value we obtain is LDA effect size.
This might venture a bit into personal preferences. I understand the rationale for filtering only based on effect size. Multiple correction, however, gives theoretical control over false positive rates / false discovery rates, albeit often conservatively. I personally like that, but I’m sure many researchers would have different opinions, probably justifiably so.
I have been running LEfSe on the Galaxy platform with 2 classes (no subclasses). After reading this thread, I determined that I would like to correct for multiple comparisons since Wilcoxon is not being performed due to the lack of subclasses. I did what you suggested in that I set the alpha threshold for LEfSe to be 1.0 so that output would show all p-values.
I have a question regarding this output compared to output I obtained before using the standard cutoff of 0.05. How much would you expect results to vary across iterations of the analysis? When using standard cutoff I obtained 21 features with p < 0.05 and log LDA > 2.0; using a cutoff of 1.0 I obtained 12 features with p < 0.05 and log LDA > 2.0. I understand that bootstrapping is used to generate the LDA values, so we would expect some variation. I just want to confirm that using a cutoff of 1.0 to obtain all p-values does not fundamentally alter the analysis in some other way.
Relatedly, is there a way to set a parameter so that one could obtain the same results twice (equivalent to the set.seed() function in R)? I would like to run LEfSe with a 1.0 cutoff to obtain all p-values for multiple comparison correction, and then run it again with only features below a cutoff of 0.05 deemed significant for plotting purposes. However, this doesn’t work if the results obtained in each run are slightly different.