Biobakery_workflows : Broken hand-off of handling fastq compression between Kneaddata and MetaPhlan

Although the fastq.gz files are accepted by the biobakery_workflows initially, there’s a breakdown in the handoff between Kneaddata and MetaPhlan that produces the error:

Error, input data has to be in fastq or fasta format

I’ve tried g unzipping the files, deleting the output directory with all its contents, updating the command to include flags for --input-extension fastq and --qc-options=“–input_type fastq” and re-running, but I get the same error (“Error, input data has to be in fastq or fasta format”).

Notably in the log output, it shows the file names go from having no extension in the hands of Kneaddata to having a “.gz” extension in the hands of MetaPhlan. My guess is there’s an issue with how the extension is being stripped somewhere in the code based on locating a period and that’s causing the break.

Conversely, the only other thing I can think of is that for some reason in the output directory, KneadData creates another file with the same name as the original fastq.gz file (but it’s a fraction of the size), and maybe MetaPhlan is trying to pick up that copy, expecting it to have a fastq extension and breaking there because it isn’t unzipped. I’m not sure why that copy is being made to begin with or how it differs or why it might be used in place of the original though.

Log:

(Feb 04 13:19:50) [ 0/52 -   0.00%] **Ready    ** Task  3: kneaddata____20091FL-05-01-75_S143_L007_R1_001
(Feb 04 13:19:50) [ 0/52 -   0.00%] **Started  ** Task  3: kneaddata____20091FL-05-01-75_S143_L007_R1_001
(Feb 04 13:39:39) [ 1/52 -   1.92%] **Completed** Task  3: kneaddata____20091FL-05-01-75_S143_L007_R1_001
(Feb 04 13:39:39) [ 1/52 -   1.92%] **Ready    ** Task  8: metaphlan____20091FL-05-01-75_S143_L007_R1_001.gz
(Feb 04 13:39:39) [ 1/52 -   1.92%] **Started  ** Task  8: metaphlan____20091FL-05-01-75_S143_L007_R1_001.gz
(Feb 04 13:40:12) [ 2/52 -   3.85%] **Failed   ** Task  8: metaphlan____20091FL-05-01-75_S143_L007_R1_001.gz

Original command run:

(biobakery_workflows) [4/02/25 2:00:45] ➜  ~ biobakery_workflows wmgx --input ./test --output ./test/biobakery_wf_output --pair-identifier _R1_ --qc-options="--trf /Users/nyb/miniforge3/bin --trimmomatic /Users/nyb/Trimmomatic-0.39 --threads 8 --processes 2 --reference-db ./Projects/workflow/resources/databases/mouse_refdb --reference-db ./Projects/workflow/resources/databases/human_refdb" --contaminate-databases /Users/nyb/biobakery_workflows_databases

I think I solved my own problem because I’ve moved onto a new error.

The issue seemed to be that if you are using a custom identifier, you need to list everything after the sample name and before the period where the extension starts.

So in my case, instead of adding this flag to the command:

--pair-identifier _R1_

It probably needed to be:

--pair-identifier _R1_001

That being said, for the time being since I’m going on multiple weeks of trying to get past one error after another with no guidance, I simply relabeled my files to not use a custom identifier and am going to keep trouble shooting, but figured I would update here in case others run into this issue and seek help in the forum and can benefit from what I suspect is the solution.