Too many samples discarded

Sorry, I realize this has been asked a few times for strainphlan, but I have the same problem and I could not understand which part it’s having error in. I use the most lenient settings for sample and marker threshold but it still keeps throwing this error message. I base my clade_markers of choice from metaphlan results so I am expecting these markers to be present in at least a quarter of my samples.

When I do --print-clades-only it doesn’t return me any clade either.

Command:

strainphlan -s ${WORKDIR}/consensus_markers/*.pkl \
-m ${WORKDIR}/db_markers/${sgb}.fna \
-o ${WORKDIR}/output \
-n 8 \
-d ${WORKDIR}/metaphlan-db \
--tmp ${WORKDIR}/temporary \
--debug --marker_in_n_samples 1 \ 
--sample_with_n_markers 1 \
-c ${sgb} \
--abs_n_markers_thres \
--abs_n_samples_thres \
--breadth_thres 80 \
--mutation_rates

Log:

Wed Mar 29 21:44:51 2023: Start StrainPhlAn 4.0.6 execution
Wed Mar 29 21:44:51 2023: Creating temporary directory...
Wed Mar 29 21:44:51 2023: Done.
Wed Mar 29 21:44:51 2023: Filtering markers and samples...
Wed Mar 29 21:44:51 2023: Getting markers from main samples...
Wed Mar 29 21:44:51 2023: Done.
Wed Mar 29 21:44:51 2023: Getting markers from main references...
Wed Mar 29 21:44:51 2023: Done.
Wed Mar 29 21:44:51 2023: Removing bad markers / samples...

Error message:

Wed Mar 29 21:44:51 2023: [Error] Phylogeny can not be inferred. Too many samples were discarded.Wed Mar 29 21:44:51 2023: Stop StrainPhlAn execution.

Tmp output:

 |-tmp1ijb_u1t
 | |-t__SGBxxxx.fna
 | |-blastn

Is there a way to know at least if this is a problem of the quality of my sequences (so more upstream) or if it’s something fixable in the parameters (so downstream) ? The range of size of the .pkl of my generated consensus_markers are 5.5 MB - 28.1 MB.

Thank you!

Hi @ange
Were the clade markers manually generated by you or with the extract_markers.py script?

Hi,

The clade markers were generated with extract_markers.py

Did you run metaphlan with the database version Jan21 or Oct22 ? In version 4.0.6 Oct22 is the default database, so if you ran MetaPhlAn with the previous version it will lead to this kind of results

1 Like

Okay, thank you. It seems the metaphlan results we have were generated from vJan21. The analysis I’m trying out is based on these metaphlan results so I would rather adjust to accommodate the vJan21 data. Is there a way to download chocophlan vJan21 instead or an earlier version of strainphlan that supports vJan21?

Whoops, I saw that it’s written in the documentation. I’ll download vJan21 and see. Thank you again!!

Sorry, it’s throwing a different error now- about an unsupported pickle protocol ?

Command:

strainphlan -s ${WORKDIR}/consensus_markers/*.pkl \
-r /dir/Shotgun/refseq-Acaccae_ncbi-genomes-2023-03-28/GCF_020181435.1_ASM2018143v1_genomic.fna \
-o ${WORKDIR}/output -n 8 \
-d /dir/scratch/metaphlan-db/mpa_vJan21_CHOCOPhlAnSGB_202103.pkl \
--tmp /dir/scratch/temporary/tmp \
--debug 
--marker_in_n_samples 10 \
--sample_with_n_markers 1 -c "t__SGB4529" \
--abs_n_markers_thres \
--abs_n_samples_thres \
--breadth_thres 80 \
--mutation_rates \
--print_clades_only

Log:

Mon Apr  3 12:08:29 2023: Start StrainPhlAn 4.0.6 execution
Mon Apr  3 12:08:29 2023: Loading MetaPhlAn mpa_vJan21_CHOCOPhlAnSGB_202103 database...
Mon Apr  3 12:08:50 2023: Done.
Mon Apr  3 12:08:53 2023: Detecting clades...

Error:

Traceback (most recent call last):
  File "/dir/anaconda3/envs/mph4/bin/strainphlan", line 8, in <module>
    sys.exit(main())
  File "/dir/anaconda3/envs/mph4/lib/python3.7/site-packages/metaphlan/strainphlan.py", line 624, in main
    strainphlan_runner.run_strainphlan()
  File "/dir/anaconda3/envs/mph4/lib/python3.7/site-packages/metaphlan/strainphlan.py", line 452, in run_strainphlan
    self.print_clades()
  File "/dir/anaconda3/envs/mph4/lib/python3.7/site-packages/metaphlan/strainphlan.py", line 370, in print_clades
    species2samples = self.detect_clades(markers2species)
  File "/dir/anaconda3/envs/mph4/lib/python3.7/site-packages/metaphlan/strainphlan.py", line 353, in detect_clades
    sample = ConsensusMarkers(pkl_file=sample_path)
  File "/dir/anaconda3/envs/mph4/lib/python3.7/site-packages/metaphlan/utils/consensus_markers.py", line 100, in __init__
    self.from_pkl(pkl_file)
  File "/dir/anaconda3/envs/mph4/lib/python3.7/site-packages/metaphlan/utils/consensus_markers.py", line 93, in from_pkl
    pkl_file)[1] == ".bz2" else pickle.load(open(pkl_file, "rb"))
ValueError: unsupported pickle protocol: 5

Hi @ange
It looks like the sample2markers was run with python 3.8+ which generated the pkl files with protocol 5, while you were running the strainphlan with python version lower than 3.8 (in this case 3.7). I will update the python version to >3.8 in the environment you are running strainphlan