I am encountering an error when running StrainPhlAn(version=4.0.6) using the pipeline provided in the Github(StrainPhlAn 4 · biobakery/MetaPhlAn Wiki · GitHub). The error message is"[Error] Phylogeny can not be inferred. Too many samples were discarded".
When examining the reference database file, db_markers, I noticed that there were some differences between the version provided in the link and the version that is generated when I run the pipeline. For example, the reference sequence for the SGB1877 strain is represented differently in the two versions.
“the link” offered is “>SGB1877__FHNOMNMD_02167 UniRef90_R5UVX9;k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides_caccae|t__SGB1877;ZeeviD_2015__PNP_Main_212__bin.37 ATGAAATATTTTAAAAGATTAATGATAACGCTATGTACAGCGTTCTACTTTTGCCTGTCC TCCTGCAATTACTTGAATGTGGATGAGTATTTTGCTGATACATTGGGATACGATTCTATT TTTTCAGAATAAAATGAATCTTCAGAAATATCTATGGGCTACGGCTGCTTTCTTCCCCGAT GAGGGCGCTATCTGGGGTGGTGCTTATACACCGGGTGTTACCGGTTCGGATGAAGCCTTT GTGCAATGGAACACGGGCGAATTTCCCGGAGTAACATTTGTTTTGGGGCACACGACTCCC”
which generated by running the pipeline is “>UniRef90_A0A174E9V7|4__9|SGB1877 CGAAGAATCCTGTCCGGTGAAAGCTATCAGTAAAGATGAACACGGAATAGAGCATATCGA CGAAAGCAAATGTATATATTGCGGAAAGTGCATGAATGCTTGTCCGTTCGGTGCTATCTT CGAGATTTCACAGACATTCGACGTTTTGCAACGAATCCGTAAAGGAGAGAAAATGGTGGC TATTATTGCTCCGTCTATCCTCGGGCAGTTCAAGACTTCGATCGAACAAGTATATGGAGC TTTTAAAGAAATAGGATTTACCGATGTGATTGAAGTGGCCGAAGGAGCAATGTCGACTAC”
I am not sure if this difference is the cause of the error I am encountering, but I wanted to mention it in case it is relevant.
Do you have any suggestions for how I can resolve this error? I have already tried to troubleshoot the issue by examining the data and software versions, but I have not been able to identify a solution.
The example in github was executed using the vJan21 database while now, by default, metaphlan 4 execute version vOct22. Using the parameter --index mpa_vJan21_CHOCOPhlAnSGB_202103 (will also download and install the old database, but oct22 will still be the default) should produce the same results
In the begining, I thought it was the version of dataset caused the error,so I clear vJan21 database.and rerun the pipeline.
However this error still occured in step5 when run “strainphlan -s consensus_markers/*.pkl -m db_markers/t__SGB1877.fna -r reference_genomes/G000273725.fna.bz2 -o output -n 8 -c t__SGB1877 --mutation_rates”
then i add "–marker_in_n_samples 1 --sample_with_n_markers 1 " in this code and the other error occured ,it seems like occured in the aligning process
“[e] Command ‘[’~/miniconda3/envs/Metaphlan/bin/mafft’, ‘–quiet’, ‘–anysymbol’, ‘–thread’, ‘1’, ‘–auto’, ‘output/tmpil09t85f/markers/848025373357.fna’]’ returned non-zero exit status 1.”
“[e] msas crashed” So,is it the dataset caused this error?and how to solve it?
Reducing the --marker_in_n_samples or --sample_with_n_markers to 1 (meaning 1%) will produce that some of the markers would be empty and the multiple sequence alignment by mafft will crash. But, did you also run sample2markers specifying the jan21 version of the database or only strainphlan?
I just use StrainPhlan(4.0.6) to test the pipeline. and the demo data can run fluently by Jan21,when I use datadet Oct22, this pipeline give me this error message. I don’t know if the dataset version caused this error.And if you have free time to use Oct22 dataset to run the demo data,and might help me to exclude whether is the dataset version problems.
Good luck for you
The tutorial read files were filtered out for speed up purposes to contain only reads mapping against the Jan21 markers. As the markers between versions might change, it can produce slightly different results or even not work.