MINLEN question

  1. What does it mean with “MINLEN is set to 50 percent of total input read length”? Is the MINLEN different from each sample?

  2. Different version is with different default of MINLEN, right? Our version is kneaddata/0.7.2. I think the default MINLEN is 50, right?

Thanks for reaching out to us.

  1. When MINLEN is set to 50 percent of total input read length, it is filtering out for bad reads, so if 50% of the read doesn’t make quality control(QC) scores of better than X it will just be removed.

  2. With the version kneaddata/0.7.2, the default is 50. Default has been set to 50 from the version 0.7.0 release.

Let me know if you have further questions and feel free to update this thread.

Hi Sagun,

Thank you for your answer. As for the first question, do you mean the “quality control score” as the quality per base, and if the quality of 50% bases in one read less than X, this read will be removed?
When I run with the default, I found that the MINLEN number is variable across samples in the log file? When I set the MINLEN parameter, all the samples will have the same value for MINLEN in log file. Why?

Thank you for your anwer.
Wang

Yes, that is accurate for the first part. MINLEN is computed each time, based on the read length of the sample. You can set the MINLEN length by using “–trimmomatic-options” flag.

I will need some more details to replicate your issue. Can you tell me how you are setting the value MINLEN ?

Thanks,
Sagun

Hi Sagun,
Thank you for your answer. The code I used for paired-end reads as below:
"kneaddata --input R1.fastq --input R2.fastq --output kneaddata_output --bowtie2 database_folder --trimmomatic-options “SLIDINGWINDOW:4:20 MINLEN:60”. For example, if the MINLEN is set to 60, all the samples in the log file will show MINLEN 60. May I ask why?
When I run with the default, I found that the MINLEN number in the log file is different across samples. I think this is correct as you said “MINLEN is computed each time” for each sample, right?
Thanks.
Wang

Hi user,

I am looking into the issue and trying out to see the log files. I’ll get back to you shortly. In the meantime, I would really appreciate it you could share your log file with me here.

Thanks,
Sagun

Hi Wang,

Thank you so much for providing me the log files. When the default parameter is used, it takes 50% of the total reads as MINLEN, so it seems that the fastq file that you have had 151BP of total reads. So, therefore the value of 75 was set as MINLIN using the default parameter. Also note that in the fastq file, all reads have same length i.e. 151bp for this case.

However, when you choose to pass the “trimmomatic_options = SLIDINGWINDOW:4:20 MINLEN:35”, you are not calculating the value of MINLEN every time but hard setting it to be 35. This is why it shows 35 in the logs. Apologies that it took me some time to confirm this.

Hi Sagun,
Thank you for your reply.

You are right that the longest read length is 150 bp. So when using default parameter, the MINLIN will be set as 50% of the total read length (question: does the read length refer to the longest read length of each sample? Normally there is a range of read length in each sample). After this, the filtering process will be conducted as described by Trimmomatic, rigth?

It makes sense to me based on log file that the MINLEN will be hard set to the value in “trimmomatic_options” for all the samples. I think the default is the best choice in the majority of cases, may I ask a simple question that what the advantage of default parameters (calculating the MINLEN for each sample) is compared to setting MINLEN to one value across all samples?

Thanks for many times communications.

Hi Wang,

No, it does not refer to the longest read length of each sample. The read length is calculated every time However, if you notice, all the reads in the fastq file of a sample are same ~150bp for this case. This might not be true for other sample files with different extensions and hence read length will be calculated each time.

It depends upon the use case on what you want for your results. If you want to trim 50% of the reads from a given sample, then DEFAULT parameters would be the best (NOTE that bp reads are unknown to us for this case since we are just specifying a percentage.) I agree that going with the Default parameters would yield you the best results as you do not need to hardcode the actually length of the read which can be variable across the sample.

Thanks,
Sagun Maharjan

Hi Sagun,

Thank you for your reply. Based on your reply, my understanding is that the MINLEN is calculated for every read in the sample, right? If the lengths of all reads in the sample are same, there is only one MINELEN value indicated in the log file; otherwise, all the calculated/used MINLEN values will be indicated in the log file for sample containing variable read lengths, right?

Thanks
Wang

Hi Wang,

Yes, that is my understanding as well. The MINLEN is calculated for every read in the sample. Also, if the lengths of all reads in the sample are the same, there is only one MINLEN value indicated in the log file otherwise there will be different MINLEN value for different reads.

Thanks,
Sagun

Hi Sagun,

Thank you very much for your confirmation.

Best,
Wang

Hello :slight_smile:,
I was also confused about the behavior of KneadData with regards to determining the default MINLEN argument for running trimmomatic.

Just for your information:
I looked into the source code of KneadData to research how the MINLEN value is calculated.
It turns out that KneadData just looks only at the length of the FIRST read in your FASTQ file, takes by default 50% of its length and rounded down as MINLEN argument.

For me this was not clear from reading through the above answers and it may be of interest to you if you also have raw reads of variable sequence length.

Assume your first read in your FASTQ file is: “TCGTAAATGAGGTTTCAAAGAGTTGCGTATCTCAGTTGTGGAAAGTTTCCGAT”
It has a length of 53 (without trailing spaces).
If you take 50% of this read length, you have 26.5 and rounded down yields 26.
So, in this case MINLEN:26 would be passed to trimmomatic by KneadData right now.

I am just a newbie, but I guess this default behavior may cause unexpected results, if by chance the first raw read has an unexpected or outlying length when compared to other reads in the sample.

Best regards,
Bernhard

If you are also curious, take a look here: