Options to adjust thresholds to work around 'too many samples discarded' issue


I am using StrainPhlan in MetaPhlan 4.0.6. I have been trying to generate strain phylogenies for taxonomies that appear in >50% of my samples according to my MetaPhlan outputs. Even with this threshold I am getting a lot of strainphlan issues that say “too many samples discarded”.

I noticed that there is also the following options:

[--marker_in_n_samples MARKER_IN_N_SAMPLES]                                                                                                                                                     
[--sample_with_n_markers SAMPLE_WITH_N_MARKERS]  

Both of these thresholds are automatically set to 80%. Are there suggested minimum thresholds to use or would setting these to 1% be appropriate? Or should I set the --marker_in_n_samples to 1% and keep the --sample_with_n_markers to 50% to minimize low quality (uninformative) alignments?


Hi @osvatic
Decreasing the threshold to 1% will probably lead to low quality alignments with large gappy regions, I would not go below 20% for most of the cases

Hi @aitor.blancomiguez,

Thanks for the response!

Are you referring to 1% (or 20%) for --marker_in_n_samples? or --sample_with_n_markers?

I could see how a low % in ‘sample_with_n_markers’ would effect the alignments. Currently I am trying 50% for this value and reduced the ‘marker_in_n_samples’ to 10%, which was used in a few papers.

For any of them. Reducing too much --sample_with_n_markers, will create rows in the MSA with many gaps while reducing --markers_in_n_samples will include gappy columns that will further be discarded during the trimming procedure.