Thanks for your question, it seems this has come up with some other people as well.
As you mentioned in our correspondence, the issue arises when there are more than two classes. Even though there is no subclass variable, LEfSe still performs pairwise Wilcoxon tests among the classes in a multiclass grouping variable.
From the original paper,
" Our multiclass strategy for the Wilcoxon test depends on the problem-specific strategy chosen by the user to define features differentially distributed among the n classes. In the most stringent strategy, we require that all n abundance profiles of a feature are statistically significantly distinct among all n classes. This strategy, called ‘strict’, is implemented by requiring that all Wilcoxon tests between classes are significant. A more permissive strategy, called ‘non-strict’, considers a feature as a biomarker if at least one class is significantly different from all others. The more permissive strategy thus needs to satisfy only a subset of the Wilcoxon tests. Regardless of the strategy, the LDA step always reports the highest score detected among all pairwise class comparisons."
So, when you use the strict “all against all” comparison, first the KW global test is done which identified 85 features with significant variation across your four groups. Then, pairwise Wilcoxon tests are performed between each pair of groups. In the all-against-all setting with alpha set to 1, ALL pairwise tests must have p<1.0 in order for the feature to pass the all-against-all criterion. In the case where 2 or more groups do not contain the feature at all, then the p-value for at least one of the pairwise Wilcoxon tests is EQUAL to 1 (and therefore not LESS than 1), and so that feature is not included in the number of discriminative features (54 in your case).
One way to deal with this is to use the one-against-all option for multi-class analysis, which only requires that one of the pairwise Wilcoxons be significant in order to keep the feature in the list of significantly discriminating features. In that case, you should keep the p-value for the Wilcoxon test at a meaningful level (0.05, for instance) in order to only pick features that are significantly different between at least two classes within the multiclass grouping variable.
I hope this helps, and let me know if you have any further questions!