Deeper issue to "Phylogeny can not be inferred. Too many samples were discarded"

Hello I tried changing the values for the flags “–marker_in_n_samples” and “–sample_with_n_maker”. They did not work.

Many of my metagenome samples have no idenfied microbial species. I removed them and only included the ones with identified spiecies and abundance values from the profile.tsv files. The error message still pop up.

I went ahead and selected a subset of the metagenomes samples that have 100% abundance (making sure the markers should be identified) of the species I focus on, and rerun the strainphlan, however the same error message popped up.

Lastly I changed my species specification in strainphlan to a more common microbe and reran strainphlan. It still did not work.

When I look back on my error log, there was a bowtie issue occured during db marker fasta stage yet the error did not stop the bash flow. Would this be the issue? Could any one help address? I created a conda environment for the metaphlan package.

Thu Feb 17 15:16:39 2022: Start samples to markers execution

Thu Feb 17 15:16:39 2022: Decompressing samples…

Thu Feb 17 15:16:41 2022: Done.

Thu Feb 17 15:16:41 2022: Converting samples to BAM format…

Thu Feb 17 15:16:47 2022: Done.

Thu Feb 17 15:16:47 2022: Sorting BAM samples…

Thu Feb 17 15:16:54 2022: Done.

Thu Feb 17 15:16:54 2022: Getting consensus markers from samples…

Thu Feb 17 15:16:54 2022: Processing sample: consensus_markers/tmp/XXX.end1.fq.bam

Thu Feb 17 15:17:03 2022: Done.

Thu Feb 17 15:17:03 2022: Processing sample: consensus_markers/XXX.end1.fq.bam

Thu Feb 17 15:17:16 2022: Done.

Thu Feb 17 15:17:16 2022: Processing sample: consensus_markers/XXX.end2.fq.bam

Thu Feb 17 15:17:26 2022: Done.

Thu Feb 17 15:20:46 2022: Finish samples to markers execution (247.21 seconds): Results are stored at “/XX/Strainphlan_analyses/consensus_markers/”

Thu Feb 17 15:20:46 2022: Start extract markers execution

Thu Feb 17 15:20:46 2022: Generating DB markers FASTA…/miniconda3/envs/XXX/bin/bowtie2-inspect:24: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module’s documentation for alternative uses

import imp

Thu Feb 17 15:21:49 2022: Done.

Thu Feb 17 15:21:49 2022: Loading MetaPhlan 3.0 database…

Thu Feb 17 15:21:58 2022: Done.

Thu Feb 17 15:21:58 2022: Number of markers for the clade “s__XXX”: 133

Thu Feb 17 15:21:58 2022: Exporting markers…

Thu Feb 17 15:22:14 2022: Done.

Thu Feb 17 15:22:14 2022: Finish extract markers execution (87.74 seconds): Results are stored at “/XX/Strainphlan_analyses/db_markers/”

Thu Feb 17 15:22:15 2022: Start StrainPhlAn 3.0 execution

Thu Feb 17 15:22:15 2022: Creating temporary directory…

Thu Feb 17 15:22:15 2022: Done.

Thu Feb 17 15:22:15 2022: Getting markers from main sample files…

Thu Feb 17 15:22:15 2022: Done.

Thu Feb 17 15:22:15 2022: Getting markers from main reference files…

Thu Feb 17 15:22:15 2022: Done.

Thu Feb 17 15:22:15 2022: Removing bad markers / samples…

[e] Phylogeny can not be inferred. Too many samples were discarded

Thu Feb 17 15:22:15 2022: Stop StrainPhlAn 3.0 execution.

Hi @Dahn-young_Dong thanks for getting in touch.
May I ask how many samples are you using for the analysis? For StrainPhlAn to run on a specific species, at least 4 samples have to be retained at the end of the markers / samples filtering. In your case, it seems that, for the species you are trying to analyse, you do not have 4 samples with at least –sample_with_n_makers% of the markers reconstructed by sample2markers.py.

About the bowtie2 issue, it seems to be a just a warning related to some versioning issue when importing a dependency / function that seems deprecated, but it should not affect your results.

Best,
Aitor

Hi Aitor,

The ~90 metagenome samples are the bioimformatic products after computational removal the host organism’s genome sequences. I tried to include all these ~90 metagenome samples in fq format for strain-level analyses. It gave error message.

I later examined, there are only ~10 metagenomes out of the ~90 that have the specific species identified by MetaPhlan profiling. I selected out the .pkl outputs for these ~10 metagenomes and reran the strainphlan. It gave the error message.

Could you share some resources related to the biology and bioimformatics behind sample2markers.py? I don’t understand why the ~10 metagenome samples with the specific species already identified by Metaphlan failed to generate enough markers by sample2markers.py. I could go back to my abundance profiling table to see if I misread.

Best,

Hi @Dahn-young_Dong ,
Both MetaPhlAn and StrainPhlAn methods are described here: Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3 | eLife
Are the samples you are analysing low in microbial biomass (e.g. tissue samples)? In this case, it is possible that you do not have enough coverage of your markers for running StrainPhlAn but still enough for getting results from MetaPhlAn, since the latest is more sensitive. This is because MetaPhlAn just needs to get 20% of the available markers of a species mapping against just 1 read each to detect a species, while StrainPhlAn (sample2markers.py script) needs to have the markers covered at, at least, 80% of their length by reads mapping. This is difficult to assess from the default MetaPhlAn profiles as it is reported in terms of relative abundance. You could try to run metaphlan with the option -t rel_ab_w_read_stats to have a better picture of the species coverage in your reads.

Hello thank you for providing help.
As you suggested, I ran metaphlan with option -t to reveal coverage of the reads. I found:

#clade_name	clade_taxid	relative_abundance	coverage	estimated_number_of_reads_from_the_clade
UNKNOWN	-1	0.0	-	555034
s__XXXX	12345	62.89044	-	-
s__XXXX	12345	19.32389	-	-

Under the columns of coverage anad est. number of reads, they all show “-” a dash. What does that mean? Does it say anything conclusive?

Best wishes,

Hi @Dahn-young_Dong are you running it with the --unknown_estimation parameter?