Hi
When I use Kneaddata (v0.12.0) to trim sequences files and separate the reads from host(mouse), very few reads are left after the whole pipeline. I am using the default parameters.
Command:
kneaddata --input1 sample1.fq --input2 sample2.fq -db GRCm39_mice_db --output /output/ -t 20 --run-fastqc-start --run-fastqc-end --trimmomatic /home/software/trimmomatic/Trimmomatic-0.39 --sequencer-source none
one example of the log file:
11/30/2022 05:52:16 PM - kneaddata.utilities - INFO: Running bowtie2 ...
11/30/2022 05:52:16 PM - kneaddata.utilities - INFO: Execute command: kneaddata_bowtie2_discordant_pairs --bowtie2 /home/lzhang/miniconda3/envs/mice_project/bin/bowtie2 --threads 20 -x /sbidata/projects/lzhang/2022_mice/Analysis/kneaddata_database/GRCm39/GRCm39_mice_db --mode strict --bowtie2-options "--very-sensitive-local --phred33" -1 /sbidata/projects/lzhang/2022_mice/Data/rawData_part2/processed/D14_4F8_1.new_kneaddata.repeats.removed.1.fastq -2 /sbidata/projects/lzhang/2022_mice/Data/rawData_part2/processed/D14_4F8_1.new_kneaddata.repeats.removed.2.fastq --un-pair /sbidata/projects/lzhang/2022_mice/Data/rawData_part2/processed/D14_4F8_1.new_kneaddata_GRCm39_mice_db_bowtie2_paired_clean_%.fastq --al-pair /sbidata/projects/lzhang/2022_mice/Data/rawData_part2/processed/D14_4F8_1.new_kneaddata_GRCm39_mice_db_bowtie2_paired_contam_%.fastq -U /sbidata/projects/lzhang/2022_mice/Data/rawData_part2/processed/D14_4F8_1.new_kneaddata.repeats.removed.unmatched.1.fastq,/sbidata/projects/lzhang/2022_mice/Data/rawData_part2/processed/D14_4F8_1.new_kneaddata.repeats.removed.unmatched.2.fastq --un-single /sbidata/projects/lzhang/2022_mice/Data/rawData_part2/processed/D14_4F8_1.new_kneaddata_GRCm39_mice_db_bowtie2_unmatched_%_clean.fastq --al-single /sbidata/projects/lzhang/2022_mice/Data/rawData_part2/processed/D14_4F8_1.new_kneaddata_GRCm39_mice_db_bowtie2_unmatched_%_contam.fastq -S /dev/null
11/30/2022 06:01:17 PM - kneaddata.utilities - DEBUG: b'6279552 reads; of these:\n 6279552 (100.00%) were unpaired; of these:\n 6541 (0.10%) aligned 0 times\n 930147 (14.81%) aligned exactly 1 time\n 5342864 (85.08%) aligned >1 times\n99.90% overall alignment rate\npair1_aligned : 2691082\npair2_aligned : 2691082\npair1_unaligned : 2140\npair2_unaligned : 2140\norphan1_aligned : 434424\norphan2_aligned : 457777\norphan1_unaligned : 406\norphan2_unaligned : 501\n'
11/30/2022 06:01:17 PM - kneaddata.utilities - DEBUG: Checking output file from bowtie2 : /sbidata/projects/lzhang/2022_mice/Data/rawData_part2/processed/D14_4F8_1.new_kneaddata_GRCm39_mice_db_bowtie2_paired_clean_1.fastq
11/30/2022 06:01:17 PM - kneaddata.utilities - DEBUG: Checking output file from bowtie2 : /sbidata/projects/lzhang/2022_mice/Data/rawData_part2/processed/D14_4F8_1.new_kneaddata_GRCm39_mice_db_bowtie2_paired_clean_2.fastq
11/30/2022 06:01:18 PM - kneaddata.run - INFO: Total contaminate sequences in file ( /sbidata/projects/lzhang/2022_mice/Data/rawData_part2/processed/D14_4F8_1.new_kneaddata_GRCm39_mice_db_bowtie2_paired_contam_1.fastq ) : 2691082.0
11/30/2022 06:01:20 PM - kneaddata.run - INFO: Total contaminate sequences in file ( /sbidata/projects/lzhang/2022_mice/Data/rawData_part2/processed/D14_4F8_1.new_kneaddata_GRCm39_mice_db_bowtie2_paired_contam_2.fastq ) : 2691082.0
11/30/2022 06:01:20 PM - kneaddata.run - INFO: Total contaminate sequences in file ( /sbidata/projects/lzhang/2022_mice/Data/rawData_part2/processed/D14_4F8_1.new_kneaddata_GRCm39_mice_db_bowtie2_unmatched_1_contam.fastq ) : 434424.0
11/30/2022 06:01:20 PM - kneaddata.run - INFO: Total contaminate sequences in file ( /sbidata/projects/lzhang/2022_mice/Data/rawData_part2/processed/D14_4F8_1.new_kneaddata_GRCm39_mice_db_bowtie2_unmatched_2_contam.fastq ) : 457777.0
11/30/2022 06:01:20 PM - kneaddata.utilities - INFO: READ COUNT: decontaminated GRCm39_mice_db pair1 : Total reads after removing those found in reference database ( /sbidata/projects/lzhang/2022_mice/Data/rawData_part2/processed/D14_4F8_1.new_kneaddata_GRCm39_mice_db_bowtie2_paired_clean_1.fastq ): 2140.0
11/30/2022 06:01:20 PM - kneaddata.utilities - INFO: READ COUNT: decontaminated GRCm39_mice_db pair2 : Total reads after removing those found in reference database ( /sbidata/projects/lzhang/2022_mice/Data/rawData_part2/processed/D14_4F8_1.new_kneaddata_GRCm39_mice_db_bowtie2_paired_clean_2.fastq ): 2140.0
11/30/2022 06:01:20 PM - kneaddata.utilities - INFO: READ COUNT: final pair1 : Total reads after merging results from multiple databases ( /sbidata/projects/lzhang/2022_mice/Data/rawData_part2/processed/D14_4F8_1.new_kneaddata_paired_1.fastq ): 2140.0
11/30/2022 06:01:20 PM - kneaddata.utilities - INFO: READ COUNT: final pair2 : Total reads after merging results from multiple databases ( /sbidata/projects/lzhang/2022_mice/Data/rawData_part2/processed/D14_4F8_1.new_kneaddata_paired_2.fastq ): 2140.0
overview of the whole procedure count number
Sample | raw pair1 | raw pair2 | trimmed pair1 | trimmed pair2 | trimmed orphan1 | trimmed orphan2 | decontaminated GRCm39_mice_db pair1 | decontaminated GRCm39_mice_db pair2 | decontaminated GRCm39_mice_db orphan1 | decontaminated GRCm39_mice_db orphan2 | final pair1 | final pair2 | final orphan1 | final orphan2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
my_sample | 3872223 | 3872223 | 3489545 | 3489545 | 164393 | 157032 | 2140 | 2140 | 406 | 501 | 2140 | 2140 | 406 | 501 |
Some samples have relatively okay results, but some have few reads left.
I am not sure whether is too much contaminated or something went wrong in the bowtie procedure?
Thanks a lot!