I hope you are doing well. I am reaching out with what I think is a serious problem with LEfSe. We have just realized that modifying names of features (e.g. genes or species) changes the results of the analysis. For example,in one dataset I am currently working on, replacing “sp._oral_taxon” with “HOT” to make the names shorter, results in 21 differentially abundant species instead of 22, and the LDA score also changes. When I tried to replace the species names with just numbers, I get 16 differentially abundant taxa instead. I have suspended submitting a manuscript until this is resolved.
I am facing the same problem with both the Galaxy and command line versions. Please find attached 3 versions of the input file I have used recently:
1- With full species names
2- With abbreviated species names
3- species names replaced with numbers
I obtain 22, 21, and 16 differentially abundant taxa with these files, respectively, at LDA score of >= 2.5.JIA_species_level_names_changed.txt (163.6 KB) JIA_species_level_names_replaced_by_numbers.txt (159.4 KB) JIA_species_level.txt (164.3 KB)
I’ll have a look at this as soon as possible. This seems a very serious issue.
It seems that some of the species names have dots and pipes inside the name.
Lefse interprets the pipe character as a split for the creation of new taxonomic levels, in the results of
run_lefse.py you can see that
Neisseria flavescens|subflava has been split in two species which results in the addition of a new feature to the testing (line 276 of
format_input.py for reference).
This has not been done when the species names have been replaced with numbers.
The same behavior of hierarchy building can happen also when dots are present in the name (
Fusobacterium_sp._HOT_204), line 196 of
format_input.py for reference.
Thank you for taking the time to look into this, but I think the problem is independent of the dots and pipes. The difference between the two files with names is that “oral_taxon” was changed to “HOT” so dots and pipes remain the same … and from my experience LEfSe does not split by dots, it just replaces them with underscore. BTW, I ran all three files by MaAsLin and the results were identical.
Let me share with you a simpler example. Attached are two input files for genus-level data. One has names with no symbols (including underscores). In the other file, the names are replaced by numbers. The number of differentially abundant features identified are 4 and 7 for the two files, respectively.
I believe the software has a bug.
P.S.: I am not able to attach the files (it says new users can not upload files), so I have sent them to Sagun
JIA_genus_level_names changed.txt (50.8 KB) JIA_genus_level_names_replaced_by-numbers.txt (50.2 KB)
It seems I found out why I had different results with the two species-level input files with names (JIA_species_level.txt and JIA_species_level_names_changed.txt). One species is listed twice in the original file as follows:
So the only difference is in the “O” (Capital vs. small), so the software recognizes them as two different species. However, with the renaming done in the other file (oral_taxon replaced by HOT), both names become identical and software then merges them together!
This however only explains the differences in results obtained with these two files, but of course does not explain why we get totally different results when species names are replaced by numbers. I hope you continue working on it.
If this is still an issue, could you share the command you used on the two provided input files? Running the following command does seem to generate identical results, albeit ordered differently.
format_input.py JIA_genus_level_names\ changed.txt JIA_name.in -u 2
run_lefse.py JIA_name.in JIA_name.out
format_input.py JIA_genus_level_names_replaced_by-numbers.txt JIA_number.in -u 2
run_lefse.py JIA_number.in JIA_number.out