Hello, I would like clarification about the exact input FASTQ files that KneadData passes into each internal step.
I know the overall workflow is something like:
-
(if input is .gz) decompress
-
reformat sequence identifiers
-
trimming (Trimmomatic)
-
decontamination (Bowtie2)
-
(optional) repeated remove – second decontamination step
However, I’m trying to understand which exact intermediate files (with filenames) are used as the input to each stage.
Each step creates a file (sometimes temporary) that is used as input for the next. All files are the input basename + “_kneaddata_”:
- Decompression: Temporary files with the prefix “decompressed_”
- Reformat identifiers: Temporary files with the prefix “reformatted_identifiers”
- Trim: Files with the suffix “.trimmed.fastq”. For paired end files, “pair1” is used for files which have mates and “orphan1” is used for those which do not after trimming
- TRF: Files with the suffix “.repeats.removed.fastq”
- Decontam: Contaminant read files with the database name e.g. “demo_db_bowtie2_contam” and cleaned output just with “_kneaddata.fastq”. For paired end input, “paired_1.fastq” for paired files and “unmatched_1.fastq” for orphan files from the previous step.