Scripting with strainphlan.py

Hi,

I am trying to run a script with a loop to using a number of extracted markers (located in “db_markers”) over my sample set (.pkl files) in “consensus_markers” and output results into “output”.

For a particular SGB e.g. 9228, I am able to run the following with no issue:

strainphlan -s consensus_markers/*.pkl
-m db_markers/t__SGB9228.fna
-o output/SGB9228
-n 8
-c t__SGB9228
–mutation_rates

When I then try to run this in a script:

#!/bin/bash

Define paths

CONS_MARKERS_DIR=“consensus_markers”
DB_MARKERS_DIR=“db_markers”
OUTPUT_BASE=“output”

Loop through each .fna file in the db_markers folder

for REFERENCE_FILE in ${DB_MARKERS_DIR}/t__*.fna; do
# Extract the reference group name from the filename
REFERENCE_GROUP=“$(basename “${REFERENCE_FILE}” .fna)”

# Construct the output directory path
OUTPUT_DIR="${OUTPUT_BASE}/${REFERENCE_GROUP}" 

# Run the strainphlan command
strainphlan -s ${CONS_MARKERS_DIR}/*.pkl \
             -m "$REFERENCE_FILE" \
             -o "$OUTPUT_DIR" \
             -n 8 \
             -c "$REFERENCE_GROUP" \
             --mutation_rates

done

I get the following error…

Processing file: db_markers/t__SGB9228.fna
Reference group: db_markers/t__SGB9228
Output directory: output/db_markers/t__SGB9228
Running strainphlan command…
Mon Aug 28 14:56:53 2023: Start StrainPhlAn 4.0.6 execution
Mon Aug 28 14:56:53 2023: Creating temporary directory…
Mon Aug 28 14:56:53 2023: Done.
Mon Aug 28 14:56:53 2023: Filtering markers and samples…
Mon Aug 28 14:56:53 2023: Getting markers from main samples…
Mon Aug 28 14:56:55 2023: Done.
Mon Aug 28 14:56:55 2023: Getting markers from main references…
Mon Aug 28 14:56:55 2023: Done.
Mon Aug 28 14:56:55 2023: Removing bad markers / samples…
Mon Aug 28 14:56:55 2023: Done.
Mon Aug 28 14:56:55 2023: Getting markers from secondary samples and references…
Mon Aug 28 14:56:55 2023: Done.
Mon Aug 28 14:56:55 2023: Done.
Mon Aug 28 14:56:55 2023: Writing samples as markers’ FASTA files…
Mon Aug 28 14:56:55 2023: [Error] An error ocurred when creating the “[Errno 2] No such file or directory: ‘output/db_markers/t__SGB9228/tmpehrlwlp_/db_markers/t__SGB9228.StrainPhlAn4’” folder
Mon Aug 28 14:56:55 2023: Stop StrainPhlAn execution.
Done processing file: db_markers/t__SGB9228.fna

It looks like it is running into an issue with where the temporary files are being created, but despite (many) modifications to the script (all driven by AI) I can’t work out how to get around this, and think it may require a modification of the strainphlan.py script?

Already way beyond my level of expertise, and so would be grateful for any advice as to how to overcome this issue.

Thanks in advance.

for f in db_markers/t__*.fna; do
echo “Running StrainPhlAn on ${f}”
bn=$(basename ${f})
id=${bn%.fna}
mkdir -p “output/${id}”
strainphlan -s consensus_markers/**.pkl
-m “$f”
-o “output/${id}”
-n 8
-c “$id”
–mutation_rates
done

Hi @blair.merrick , you can try to run strainphlan without the -m parameter. If don’t specify, StrainPhlAn will automatically run extract_markers.py to get the clade markers for the specified -c parameter

Thanks Aitor.
Sorry only half of my follow-up message seems to have posted.
Managed to get it work, so all good.
Best wishes,
Blair