Filtering settings StrainPhlAn 4.1

Lapo_Ragionieri · February 17, 2025, 6:09pm

Hi,
I am running MetaPhlAn/StrainPhlAn 4.1 on two different data. I use an approach similar to the one shown in the tutorial. In one dataset (dataset1) the results are perfect, while in the second dataset (dataset2) based on 50 samples all samples were removed after filtering (step 5 StrainPhlAn tutorial).

In step 1 of datatset2 I got the following message that “Warning: The metagenome profile contains clades that represent multiple species merged …”

I first checked the files in the profiles folder and my species is clearly correctly identified.

In the folder db_markers I checked the *.fna file and there are only 17 sequences (much less than the ones identified in dataset1 - the ones that worked properly).

Second I try to change the filtering by changing --marker_in_n_samples 1 --sample_with_n_markers 1. But I still have the same problem.

I can understand that during the filtering steps some samples with less good coverage or quality have been removed, but I am not understanding why all samples have been removed if the species have been correctly identified (see profiles folder).

Do you have a possible solution to understand the limitation of my dataset2?
Many thanks in advance for any support.

Best
Lapo

Michal_Puncochar · February 20, 2025, 10:42am

Hi Lapo,

it seems that your species (SGB) of choice for dataset2 has only 17 markers in the database. In StrainPhlAn 4.1 by default we require 20 markers, so it’s impossible for any sample to pass this filter. I would suggest setting --sample_with_n_markers that controls this to a lower number such as 10 or even 8 or 5. Just be aware that phylogeny tree on such a low number of markers might not be that reliable.

Best

Michal

Lapo_Ragionieri · February 20, 2025, 1:59pm

Dear MIchal,
many thanks for your answer. The species with only 17 markers is Wolbachia (t__SGB5537) and I am surprised that there are less than 20 markers.

I tried to access the SGB database to retrieve more information but I could not. Is there any protocol an user can use to implement more markers or genomes for its own study species?

Best
Lapo

Michal_Puncochar · February 21, 2025, 9:58am

It’s unfortunately a limitation of our database. We use genes as markers if they are core (conserved within the SGB) and specific (not found in other SGBs). For Wolbachia pipientis we have several SGBs in Jun23 (SGB5537, SGB5535, SGB5536) that might be considered different subspecies of W. pipientis (probably they are adapted to different hosts) but are different enough that in our database they are clustered as different SGBs. And it seems they have somewhat similar pangenome so the specific markers are not so many, only 17.

I can say that we are working on producing a new database where it seems we would have more markers (>100), but it will still take several months I estimate to publish it. In the meantime you can try to build phylogeny with the lower number of markers using the parameter I wrote previously. There are some ways to work with markers of the whole genus that might be several hundred and this way get more sensitive detection and more robust phylogeny, but it’s more complex to set up and not something that’s available to the users, it depends on your application whether it’s worth the effort.

Lapo_Ragionieri · February 25, 2025, 8:13pm

Hi Michal,

Thank you for your suggestion. I have tested your settings (–marker_in_n_samples 8) but encountered the same error message. I look forward to comparing my results with the new database.

Best regards,

Lapo

Michal_Puncochar · February 26, 2025, 11:36am

Hi Lapo, could you please specify the exact error message? Is it during the execution of StrainphlAn?

The warning

shouldn’t be relevant to the strainphlan.

Michal

Lapo_Ragionieri · February 26, 2025, 12:58pm

Hi Michal,

with the normal settings that worked for the other species:

Thu Feb 20 13:45:42 2025: [Error] Phylogeny can not be inferred. Less than 4 samples remained after filtering.
0 / 8 samples (0 primary) and 17 / 17 markers remained.Thu Feb 20 13:45:42 2025: Stop execution.

And using the suggested settings from you:

Tue Feb 25 20:13:07 2025: [Error] Phylogeny can not be inferred. Less than 4 samples remained after filtering.
0 / 8 samples (8 primary) and 14 / 17 markers remained.Tue Feb 25 20:13:07 2025: Stop execution.

Best
Lapo

Michal_Puncochar · February 26, 2025, 1:13pm

Now I have noticed

that you maybe used --marker_in_n_samples instead of --sample_with_n_markers. Could you try lowering the second one to 8 instead of the first one? The first one filteres the markers and you see that 14/17 got used as a result, while the second one controls the filtering of the samples so hopefully at least some will get included.

Michal

Lapo_Ragionieri · February 26, 2025, 2:15pm

Hi Michal,
unfortunately, it’s always the same message. Could you please provide the NCBI accession numbers for SGB5537, SGB5535, and SGB5536? I am unable to retrieve them from the SGB database as I cannot open the link (http://opendata.lifebit.ai/table/?project=sgb).

This might explain why there are so few markers or samples with markers.

Best
Lapo

Topic		Replies	Views
Metaphlan-Strainphlan discrepancy StrainPhlAn	1	627	July 28, 2022
Too many samples discarded StrainPhlAn	7	599	April 5, 2023
Too many SGB or other unclassified species when processing mouse metagenomic samples MetaPhlAn	3	525	February 20, 2024
Metaphlan & Strainphlan Output Expectation StrainPhlAn	2	774	May 29, 2021
Deeper issue to "Phylogeny can not be inferred. Too many samples were discarded" StrainPhlAn	5	895	March 7, 2022

Filtering settings StrainPhlAn 4.1

Related topics