I’m attempting to do a strainphlan analysis of Blautia wexlerae, but nothing I’m doing is working. I think the problem might be the reference genomes that I’m using, but I’m using the 2 primary genomes from NCBI. Is there another place to get correctly formatted references?
Error
Here’s the primary command I was trying, along with the output:
❯ strainphlan -s consenseus_markers/*.pkl -m db_markers/s__Blautia_wexlerae.fna -r ../genomes/Blautia_wexlerae/GCF_025148125.1_ASM2514812v1_genomic.fna ../genomes/Blautia_wexlerae/GCA_025148125.1_ASM2514812v1_genomic.fna -o output -n 16 -c s__Blautia_wexlerae --mutation_rate
Mon Mar 20 10:23:57 2023: Start StrainPhlAn 3.1.0 execution
Mon Mar 20 10:23:57 2023: Creating temporary directory...
Mon Mar 20 10:23:57 2023: Done.
Mon Mar 20 10:23:57 2023: Getting markers from main sample files...
Mon Mar 20 10:23:57 2023: Done.
Mon Mar 20 10:23:57 2023: Getting markers from main reference files...Warning: [blastn] Examining 5 or more matches is recommended
Warning: [blastn] Examining 5 or more matches is recommended
Mon Mar 20 10:23:58 2023: Done.
Mon Mar 20 10:23:58 2023: Removing bad markers / samples...
[e] Phylogeny can not be inferred. Too many samples were discarded
Mon Mar 20 10:23:58 2023: Stop StrainPhlAn 3.0 execution.
Samples
I have 371 stool samples from kids that have B wexlerae according to metaphlan, with the following abundance / read depth distributions (this is all samples, not just those with B. wexlerae):
Things I’ve tried
- I tried downloading ~100 reference genomes from ENA. After fixing the identifier lines to make them play nicely with
makeblastdb
, I tried including them in the-r
argument, and the only difference is that I get ~100 lines ofWarning: [blastn] Examining 5 or more matches is recommended
. - I tried restricting the samples I tested to those with >1%, >2%, >5%, and >10% abundance Same result each time.
- I tried with
--marker_in_n_samples 20 --sample_with_n_markers 20
, same result. - I tried running without reference genomes, sample problem
Earlier commands
❯ sample2markers -i sams/*.sam.bz2 -o consensus_markers/ -n 16
Thu Mar 16 11:56:50 2023: Start samples to markers execution
Thu Mar 16 11:56:50 2023: Decompressing samples...
Thu Mar 16 11:57:17 2023: Done.
Thu Mar 16 11:57:17 2023: Converting samples to BAM format...
Thu Mar 16 11:57:42 2023: Done.
Thu Mar 16 11:57:42 2023: Sorting BAM samples...
Thu Mar 16 11:58:05 2023: Done.
Thu Mar 16 11:58:05 2023: Getting consensus markers from samples...
Thu Mar 16 11:58:05 2023: Processing sample: consensus_markers/tmpfor3b8wt/FG02304_S75.bam
Thu Mar 16 12:02:04 2023: Done.
Thu Mar 16 12:02:04 2023: Processing sample: consensus_markers/tmpfor3b8wt/FG02296_S59.bam
Thu Mar 16 12:07:09 2023: Done.
#... etc
I don’t have the output from extract_markers
, but i ran extract_markers -c s__Blautia_wexlerae -o db_markers/
and it proceeded without errors.
EDIT: Just re-ran extract_markers to see
❯ extract_markers -c s__Blautia_wexlerae -o db_markers/
Mon Mar 20 10:55:34 2023: Start extract markers execution
Mon Mar 20 10:55:34 2023: Generating DB markers FASTA...
Mon Mar 20 10:56:13 2023: Done.
Mon Mar 20 10:56:13 2023: Loading MetaPhlan 3.1.0 databa
se...
Mon Mar 20 10:56:20 2023: Done.
Mon Mar 20 10:56:20 2023: Number of markers for the clad
e "s__Blautia_wexlerae": 150
Mon Mar 20 10:56:20 2023: Exporting markers...
Mon Mar 20 10:56:31 2023: Done.
Mon Mar 20 10:56:32 2023: Finish extract markers execution (57
.27 seconds): Results are stored at "db_markers/
Final command:
❯ strainphlan -s (cat bwex_gt10_pkl.txt) \ # gets samples only > 10% abundance
-m db_markers/s__Blautia_wexlerae.fna \
-r ../genomes/Blautia_wexlerae/GCA_025148125.1_ASM2514812v1_genomic.fna \ # can leave this out, same result
-o output -n 16 -c s__Blautia_wexlerae --mutation_rate \
--marker_in_n_samples 20 --sample_with_n_markers 20
Mon Mar 20 10:57:23 2023: Start StrainPhlAn 3.1.0 execution
Mon Mar 20 10:57:23 2023: Creating temporary directory...
Mon Mar 20 10:57:23 2023: Done.
Mon Mar 20 10:57:23 2023: Getting markers from main sample fil
es...
Mon Mar 20 10:57:24 2023: Done.
Mon Mar 20 10:57:24 2023: Getting markers from main reference
files...Warning: [blastn] Examining 5 or more matches is recom
mended
Mon Mar 20 10:57:25 2023: Done.
Mon Mar 20 10:57:25 2023: Removing bad markers / samples...
[e] Phylogeny can not be inferred. Too many samples were discarded
Mon Mar 20 10:57:25 2023: Stop StrainPhlAn 3.0 execution.