Dealing with "Too many samples were discarded"

I’m attempting to do a strainphlan analysis of Blautia wexlerae, but nothing I’m doing is working. I think the problem might be the reference genomes that I’m using, but I’m using the 2 primary genomes from NCBI. Is there another place to get correctly formatted references?

Error

Here’s the primary command I was trying, along with the output:

❯ strainphlan -s consenseus_markers/*.pkl -m db_markers/s__Blautia_wexlerae.fna -r ../genomes/Blautia_wexlerae/GCF_025148125.1_ASM2514812v1_genomic.fna ../genomes/Blautia_wexlerae/GCA_025148125.1_ASM2514812v1_genomic.fna -o output -n 16 -c s__Blautia_wexlerae --mutation_rate
Mon Mar 20 10:23:57 2023: Start StrainPhlAn 3.1.0 execution
Mon Mar 20 10:23:57 2023: Creating temporary directory...
Mon Mar 20 10:23:57 2023: Done.
Mon Mar 20 10:23:57 2023: Getting markers from main sample files...
Mon Mar 20 10:23:57 2023: Done.
Mon Mar 20 10:23:57 2023: Getting markers from main reference files...Warning: [blastn] Examining 5 or more matches is recommended
Warning: [blastn] Examining 5 or more matches is recommended

Mon Mar 20 10:23:58 2023: Done.
Mon Mar 20 10:23:58 2023: Removing bad markers / samples...
[e] Phylogeny can not be inferred. Too many samples were discarded
Mon Mar 20 10:23:58 2023: Stop StrainPhlAn 3.0 execution.

Samples

I have 371 stool samples from kids that have B wexlerae according to metaphlan, with the following abundance / read depth distributions (this is all samples, not just those with B. wexlerae):

Things I’ve tried

  • I tried downloading ~100 reference genomes from ENA. After fixing the identifier lines to make them play nicely with makeblastdb, I tried including them in the -r argument, and the only difference is that I get ~100 lines of Warning: [blastn] Examining 5 or more matches is recommended.
  • I tried restricting the samples I tested to those with >1%, >2%, >5%, and >10% abundance Same result each time.
  • I tried with --marker_in_n_samples 20 --sample_with_n_markers 20, same result.
  • I tried running without reference genomes, sample problem

Earlier commands

❯ sample2markers -i sams/*.sam.bz2 -o consensus_markers/ -n 16
Thu Mar 16 11:56:50 2023: Start samples to markers execution
Thu Mar 16 11:56:50 2023: Decompressing samples...
Thu Mar 16 11:57:17 2023: Done.
Thu Mar 16 11:57:17 2023: Converting samples to BAM format...
Thu Mar 16 11:57:42 2023: Done.
Thu Mar 16 11:57:42 2023: Sorting BAM samples...
Thu Mar 16 11:58:05 2023: Done.
Thu Mar 16 11:58:05 2023: Getting consensus markers from samples...
Thu Mar 16 11:58:05 2023:       Processing sample: consensus_markers/tmpfor3b8wt/FG02304_S75.bam
Thu Mar 16 12:02:04 2023:       Done.
Thu Mar 16 12:02:04 2023:       Processing sample: consensus_markers/tmpfor3b8wt/FG02296_S59.bam
Thu Mar 16 12:07:09 2023:       Done.
#... etc

I don’t have the output from extract_markers, but i ran extract_markers -c s__Blautia_wexlerae -o db_markers/ and it proceeded without errors.

EDIT: Just re-ran extract_markers to see

❯ extract_markers -c s__Blautia_wexlerae -o db_markers/
Mon Mar 20 10:55:34 2023: Start extract markers execution
Mon Mar 20 10:55:34 2023:       Generating DB markers FASTA...
Mon Mar 20 10:56:13 2023:       Done.
Mon Mar 20 10:56:13 2023:       Loading MetaPhlan 3.1.0 databa
se...
Mon Mar 20 10:56:20 2023:       Done.
Mon Mar 20 10:56:20 2023:       Number of markers for the clad
e "s__Blautia_wexlerae": 150
Mon Mar 20 10:56:20 2023:       Exporting markers...
Mon Mar 20 10:56:31 2023:       Done.
Mon Mar 20 10:56:32 2023: Finish extract markers execution (57
.27 seconds): Results are stored at "db_markers/

Final command:

❯ strainphlan -s (cat bwex_gt10_pkl.txt) \ # gets samples only > 10% abundance
    -m db_markers/s__Blautia_wexlerae.fna \
    -r ../genomes/Blautia_wexlerae/GCA_025148125.1_ASM2514812v1_genomic.fna \ # can leave this out, same result
    -o output -n 16 -c s__Blautia_wexlerae --mutation_rate \
    --marker_in_n_samples 20 --sample_with_n_markers 20
Mon Mar 20 10:57:23 2023: Start StrainPhlAn 3.1.0 execution
Mon Mar 20 10:57:23 2023: Creating temporary directory...
Mon Mar 20 10:57:23 2023: Done.
Mon Mar 20 10:57:23 2023: Getting markers from main sample fil
es...
Mon Mar 20 10:57:24 2023: Done.
Mon Mar 20 10:57:24 2023: Getting markers from main reference
files...Warning: [blastn] Examining 5 or more matches is recom
mended

Mon Mar 20 10:57:25 2023: Done.
Mon Mar 20 10:57:25 2023: Removing bad markers / samples...
[e] Phylogeny can not be inferred. Too many samples were discarded
Mon Mar 20 10:57:25 2023: Stop StrainPhlAn 3.0 execution.

Hi @kescobo
I think the best thing to understand what is going on is to run strainphlan with the --debug option and if possible share the full content of the tmp folder created

Hmm - looks like the only thing in the directory is an fna file:

exa --tree output/debugdir/
output/debugdir
├── blastn
└── s__Blautia_wexlerae.fna

So nothing is even getting searched?

I see, so it might be a problem of having too high thresholds for the --sample_with_n_markers and --marker_in_n_samples parameters. You could try to reduce them a bit and try to see the results