Format_input.py parameters: how do they affect the pipeline and the plots?

Hello everyone,

I am trying to analyse some data with LEfSe on anaconda (python).
I successfully execute the script with the example data

.

Is the cladogram supposed to look like this? If so, how can I change the colors for the biomarkers? If not, I believe it is because no biomarker clade was found (I could not generate the other picture (step 3, see below)).

The code I run is the following:

bin/format_input.py tmp/sample.txt tmp/merged_abundance_table.lefse
bin/run_lefse.py tmp/merged_abundance_table.lefse tmp/merged_abundance_table.lefse.out -l 4
bin/plot_res.py --dpi 300 tmp/merged_abundance_table.lefse.out output_images/lefse_biomarkers.png
bin/plot_res.py --dpi 300 tmp/merged_abundance_table.lefse.out tmp/lefse_biomarkers.png

I am not sure about the format_input.py parameters (-c,-s,-u,-o) what do they do? Online I could not find any info and the code where they are used is not commented so, before trying to back-engineer everything I am glad to ask.

Cheers,

Pietro

Hi Pietro,

Thanks for the question! The cladogram you produced does look correct if there were no significant features to plot on it.

As for the parameters:
-c = row of the data to use as the class
-s = row that contains the subclass information
-u = the row with the subject information
and
-o = the normalization (for LEfSe the default is [1.0] or none)

I have attached the help page for the format LEfSe step. I hope this helps. Let us know if we can do anything else.

Best,
Kelsey

Screen Shot 2020-05-14 at 7.01.42 AM

Hi,

I just began to use this software. So in this post, sma says the normalization is done internally to get relative abundance. Does that mean if I don’t set -o parameter, the program would do it still automatically or do I have to set it? Additional question, in the tutorial, the -o set to 1000000, in the galaxy, kinda explained it a bit (Per-sample normalization of the sum of the values to 1M), but I am doing amplicon research, I am not sure what does that actually do to normalize your sample?

Thank you!

Hi -

I’m not sure what the default normalization factor is, so it’s probably safe to set your own.

Fortunately, I think due to the nature of logarithm, LEfSe results shouldn’t be sensitive to -o choices. The intuition is log(a * 1000) - log(b * 1000) = log(a * 1e6) - log(a *1e6), so normalizing things to 1M should give you the same p-values as to 1000. Now, there will be interpretation differences on the effect sizes (for example, might be easier to think of things on the CPM scale hence the 1M default). I’m not sure what a good choice is for amplicon either.

The bottom line is, I’d suggest you try a few different -o values that make sense to you, and pick one that’s easier to interpret. I’d also see if they report very different p-values - if they do I’d be worried and please let us know.

Thanks,
Siyuan