Inquiry regarding change of default parameters in StrainPhlAn

Hello,

I would like to express my sincere gratitude for your continuous efforts in developing and maintaining bioBakery tools.

Currently, I have been analyzing metagenome data through StrainPhlAn (v4.0.6 with vOct22_CHOCOPhlAnSGB_202212 database), and I have encountered an issue where output tree contains only a few samples.

In an attempt to increase the number of samples, I have been experimenting with parameter adjustments.

Following the guidance provided in previous posts, I have firstly tested adjusting the filtering criteria using --marker_in_n_samples and --sample_with_n_markers within the range of 10-50. Although this resulted in an increase in the number of samples of the output tree, the increase was not substantial.

Then I compared the abundance differences between the filtered and unfiltered samples using MetaPhlAn4 profiling and found that many samples are being filtered out despite having similar abundance levels.

Therefore, I am considering modifying parameters at the sample marker extraction stage (using the sample2markers.py script).

Specifically, I am looking into adjusting the mapping quality (--min_mapping_quality flag) and breadth of coverage (--breadth_threshold flag). However, I am uncertain about the acceptable levels of these adjustments and seek your guidance.

  1. The --breadth_threshold seems to be particularly important for identifying polymorphism in the marker genes. Is it feasible or permissible to lower this value to around 50? (Based on your experience, what would be an acceptable lower limit?)

  2. For --min_mapping_quality, MetaPhlAn4 uses a value of 5, while StrainPhlAn uses 10. Would it be acceptable to lower the mapping quality in StrainPhlAn, for example to 5, as well?
    (I understand that MetaPhlAn4 is intended for profiling purposes, whereas StrainPhlAn may requires more stringent criteria to detect small differences within marker genes.)

I have searched for related discussions but have not found any (I apologize if I missed something). I kindly ask for your understanding if I have misunderstood or overlooked any aspects.

Therefore, I am reaching out for advice.

Thank you again for your invaluable contributions to our field.

Sincerely,

Hello @jylee
those are good questions.

The min_mapping_quality I don’t expect to play a big role in how many samples you’ll get in the trees. You can lower it down to 1, but keep an eye if you’re getting outlier branches in the trees – those could mean read “wrongly” mapped reads creating false SNPs.

The breadth threshold you can try to lower, but I wouldn’t go below 50%. It will lead to inclusion of some more low coverage samples (coverage is related to both rel. abundance and seq. depth). But keep in mind that their placement in the tree will be less reliable, it depends on your goal whether you are willing to accept the possible errors. For example you could get flat branch where some samples seem like the same strain, but it’s just because there are too many gaps in their alignment. In general it’s difficult to say which value is still acceptable as it depends a lot on the species and the marker genes for that species.