Hello,
I am having issues using StrainPhlAn version 4.0.6 (1 Mar 2023)
Running the tutorial from StrainPhlAn 4 · biobakery/MetaPhlAn Wiki · GitHub
I get an [Error] Phylogeny can not be inferred. Too many samples were discarded.
I have read topics in the forum on this, but none of those have answered my issue.
I tried the following code, there is also added commentary.
The database is stored outside of the environment, so I added a bowtie2db parameter.
for f in fastq/SRS*
do
echo “Running MetaPhlAn on ${f}”
bn=$(basename ${f})
metaphlan ${f} --input_type fastq --bowtie2db /Volumes/Promise_Beast/Databases/metaphlan4_database -s sams/${bn}.sam.bz2 --bowtie2out bowtie2/${bn}.bowtie2.bz2 -o profiles/${bn}_profiled.tsv
done
This seemed to work, giving reasonable looking results. Then
sample2markers.py -i sams/*.sam.bz2 -d /Volumes/Promise_Beast/Databases/metaphlan4_database -o consensus_markers -n 8
This also seemed to produce reasonable looking results.
It took some guessing to get the extract markers command to work as it did not want to take the database folder in the -d parameter. I finally tried using the .pkl file and that seemed to work.
extract_markers.py -c t__SGB1877 -d /Volumes/Promise_Beast/Databases/metaphlan4_database/mpa_vOct22_CHOCOPhlAnSGB_202212.pkl -o db_markers/
Finally, I tried strainphlan.
strainphlan -s consensus_markers/*.pkl -m db_markers/t__SGB1877.fna -r G000273725.fna.bz2 -d /Volumes/Promise_Beast/Databases/metaphlan4_database/mpa_vOct22_CHOCOPhlAnSGB_202212.pkl -o output -n 8 -c t__SGB1877 --mutation_rates
ue Aug 1 09:24:44 2023: Start StrainPhlAn 4.0.6 execution
Tue Aug 1 09:24:44 2023: Creating temporary directory…
Tue Aug 1 09:24:44 2023: Done.
Tue Aug 1 09:24:44 2023: Filtering markers and samples…
Tue Aug 1 09:24:44 2023: Getting markers from main samples…
Tue Aug 1 09:24:45 2023: Done.
Tue Aug 1 09:24:45 2023: Getting markers from main references…
Tue Aug 1 09:24:45 2023: Done.
Tue Aug 1 09:24:45 2023: Removing bad markers / samples…
Tue Aug 1 09:24:45 2023: [Error] Phylogeny can not be inferred. Too many samples were discarded.
Tue Aug 1 09:24:45 2023: Stop StrainPhlAn execution.
This surprised me as I checked the metaphlan results for the samples and they are all 100% the strain being referenced and so should surely make the 80% of samples cutoff.
Reading the forums indicated that the tutorial at one point was set up using the jan21 database. If the tutorial hadn’t been updated since then, it was possible that differences in the databases were causing the issue. Though, given that all of the work was being done with the new database and that the sam and marker files seemed to have SGB1877 markers in them that seemed an unlikely explanation.
However, to make sure, I downloaded the jan21 database and re-ran everything using that database. The results were the same.
ue Aug 1 13:57:15 2023: Start StrainPhlAn 4.0.6 execution
Tue Aug 1 13:57:15 2023: Creating temporary directory…
Tue Aug 1 13:57:15 2023: Done.
Tue Aug 1 13:57:15 2023: Filtering markers and samples…
Tue Aug 1 13:57:15 2023: Getting markers from main samples…
Tue Aug 1 13:57:15 2023: Done.
Tue Aug 1 13:57:15 2023: Getting markers from main references…
Tue Aug 1 13:57:16 2023: Done.
Tue Aug 1 13:57:16 2023: Removing bad markers / samples…
Tue Aug 1 13:57:16 2023: [Error] Phylogeny can not be inferred. Too many samples were discarded.
Tue Aug 1 13:57:16 2023: Stop StrainPhlAn execution.
It is clearly not an issue with the version of the database. It can’t be an issue of the strain being absent, as it is in all of the samples. It seems extremely unlikely that it is an issue with too few markers matching as this is the tutorial and one would expect that the data would be set up to have sufficient coverage or that the tutorial would mention it if they didn’t.
Something is off, I just have no idea what it is. Any help resolving the problem would be appreciated.
Thank you for your time,
Erik Hendrickson
McLean lab
University of Washington