Hi,
I am running MetaPhlAn/StrainPhlAn 4.1 on two different data. I use an approach similar to the one shown in the tutorial. In one dataset (dataset1) the results are perfect, while in the second dataset (dataset2) based on 50 samples all samples were removed after filtering (step 5 StrainPhlAn tutorial).
In step 1 of datatset2 I got the following message that “Warning: The metagenome profile contains clades that represent multiple species merged …”
I first checked the files in the profiles folder and my species is clearly correctly identified.
In the folder db_markers I checked the *.fna file and there are only 17 sequences (much less than the ones identified in dataset1 - the ones that worked properly).
Second I try to change the filtering by changing --marker_in_n_samples 1 --sample_with_n_markers 1. But I still have the same problem.
I can understand that during the filtering steps some samples with less good coverage or quality have been removed, but I am not understanding why all samples have been removed if the species have been correctly identified (see profiles folder).
Do you have a possible solution to understand the limitation of my dataset2?
Many thanks in advance for any support.
Best
Lapo
Hi Lapo,
it seems that your species (SGB) of choice for dataset2 has only 17 markers in the database. In StrainPhlAn 4.1 by default we require 20 markers, so it’s impossible for any sample to pass this filter. I would suggest setting --sample_with_n_markers
that controls this to a lower number such as 10 or even 8 or 5. Just be aware that phylogeny tree on such a low number of markers might not be that reliable.
Best
Michal
Dear MIchal,
many thanks for your answer. The species with only 17 markers is Wolbachia (t__SGB5537) and I am surprised that there are less than 20 markers.
I tried to access the SGB database to retrieve more information but I could not. Is there any protocol an user can use to implement more markers or genomes for its own study species?
Best
Lapo
It’s unfortunately a limitation of our database. We use genes as markers if they are core (conserved within the SGB) and specific (not found in other SGBs). For Wolbachia pipientis we have several SGBs in Jun23 (SGB5537, SGB5535, SGB5536) that might be considered different subspecies of W. pipientis (probably they are adapted to different hosts) but are different enough that in our database they are clustered as different SGBs. And it seems they have somewhat similar pangenome so the specific markers are not so many, only 17.
I can say that we are working on producing a new database where it seems we would have more markers (>100), but it will still take several months I estimate to publish it. In the meantime you can try to build phylogeny with the lower number of markers using the parameter I wrote previously. There are some ways to work with markers of the whole genus that might be several hundred and this way get more sensitive detection and more robust phylogeny, but it’s more complex to set up and not something that’s available to the users, it depends on your application whether it’s worth the effort.
Hi Michal,
Thank you for your suggestion. I have tested your settings (–marker_in_n_samples 8) but encountered the same error message. I look forward to comparing my results with the new database.
Best regards,
Lapo
Hi Lapo, could you please specify the exact error message? Is it during the execution of StrainphlAn?
The warning
shouldn’t be relevant to the strainphlan.
Michal
Hi Michal,
with the normal settings that worked for the other species:
Thu Feb 20 13:45:42 2025: [Error] Phylogeny can not be inferred. Less than 4 samples remained after filtering.
0 / 8 samples (0 primary) and 17 / 17 markers remained.Thu Feb 20 13:45:42 2025: Stop execution.
And using the suggested settings from you:
Tue Feb 25 20:13:07 2025: [Error] Phylogeny can not be inferred. Less than 4 samples remained after filtering.
0 / 8 samples (8 primary) and 14 / 17 markers remained.Tue Feb 25 20:13:07 2025: Stop execution.
Best
Lapo
Now I have noticed
that you maybe used --marker_in_n_samples
instead of --sample_with_n_markers
. Could you try lowering the second one to 8 instead of the first one? The first one filteres the markers and you see that 14/17 got used as a result, while the second one controls the filtering of the samples so hopefully at least some will get included.
Michal
Hi Michal,
unfortunately, it’s always the same message. Could you please provide the NCBI accession numbers for SGB5537, SGB5535, and SGB5536? I am unable to retrieve them from the SGB database as I cannot open the link (http://opendata.lifebit.ai/table/?project=sgb).
This might explain why there are so few markers or samples with markers.
Best
Lapo