Error related to step 3; reference genome

Hi,
I am trying to do StrainPhlAn analysis on my samples. Here is the input I use:

strainphlan -s consensus_markers/.pkl -m clade_markers/.fna -r reference_genome/*.gbff -o output -c g__Chlamydia --phylophlan_mode fast

But I get following error;
Wed Nov 18 17:42:55 2020: Start StrainPhlAn 3.0 execution
Wed Nov 18 17:42:55 2020: Creating temporary directory…
Wed Nov 18 17:42:55 2020: Done.
Wed Nov 18 17:42:55 2020: Getting markers from main sample files…
Wed Nov 18 17:42:55 2020: Done.
Wed Nov 18 17:42:55 2020: Getting markers from main reference files…BLAST options error: reference_genome/C.muridarum_GCA_000767405.1_ASM76740v1_genomic.gbff does not match input format type, default input type is FASTA

[e] An error was ocurred executing a external tool, exiting…
Wed Nov 18 17:42:56 2020: Stop StrainPhlAn 3.0 execution.
BLAST options error: reference_genome/C.suis_GCA_900149625.2_5-27b_chromosome_genomic.gbff does not match input format type, default input type is FASTA

[e] An error was ocurred executing a external tool, exiting…
Wed Nov 18 17:42:56 2020: Stop StrainPhlAn 3.0 execution.

So I downloaded a reference genome from NCBI in fna format. and rerun the analysis;

strainphlan -s consensus_markers/.pkl -m clade_markers/.fna -r reference_genome/*.fna -o output -c g__Chlamydia --phylophlan_mode fast

But this time I get following error;

The main inputs samples + references are less than 4
Wed Nov 18 17:49:09 2020: Stop StrainPhlAn 3.0 execution

Can you please help me how I can solve this issue? What is the best way to address the reference genome?

Hi @MaryamTarazkar
How many samples do you have in the consensus_markers folder and how many genomes (with fna extension) are in reference_genome folder? The error that is showing up is because StrainPhlAn needs, at least, 4 samples to perform the analysis (this could be a mix of reference genomes and metagenomic samples or only metagenomic samples)

Hi,
Thank yo for getting back to me. Yes you are right: I had two samples in consensus_markers folder and one file ended with *fna for reference genome. I added another file for reference genome and I do not get this error anymore. However I get a new type of error, as you see in the following;
Thu Nov 19 17:26:44 2020: Start StrainPhlAn 3.0 execution
Thu Nov 19 17:26:44 2020: Creating temporary directory…
Thu Nov 19 17:26:44 2020: Done.
Thu Nov 19 17:26:44 2020: Getting markers from main sample files…
Thu Nov 19 17:26:44 2020: Done.
Thu Nov 19 17:26:44 2020: Getting markers from main reference files…Warning: [blastn] Examining 5 or more matches is recommended
Warning: [blastn] Examining 5 or more matches is recommended

Thu Nov 19 17:26:44 2020: Done.
Thu Nov 19 17:26:44 2020: Removing bad markers / samples…
[e] Phylogeny can not be inferred. Too many samples were discarded
Thu Nov 19 17:26:44 2020: Stop StrainPhlAn 3.0 execution.

I appreciate if you help me resolve this issue.
Thank you
Maryam

Hi @MaryamTarazkar
Since StrainPhlAn phylogenetic analysis is based on the MetaPhlAn species-specific markers, it is only possible to perform the analysis at the species level. The analysis you are trying is at the genus level (-c g__Chlamydia) and therefore StrainPhlAn cannot find any marker in the samples.

Best,
Aitor

Thank you for getting back to me. But I also tried at the species level and get the same error;
Here is my input;
strainphlan -s consensus_markers/.pkl -m clade_markers/.fna -r reference_genome/*.fasta -o output -c s__Chlamydia_trachomatis --phylophlan_mode fast

and I get following error;
Fri Nov 20 15:43:36 2020: Start StrainPhlAn 3.0 execution
Fri Nov 20 15:43:36 2020: Creating temporary directory…
Fri Nov 20 15:43:36 2020: Done.
Fri Nov 20 15:43:36 2020: Getting markers from main sample files…
Fri Nov 20 15:43:37 2020: Done.
Fri Nov 20 15:43:37 2020: Getting markers from main reference files…Warning: [blastn] Examining 5 or more matches is recommended
Warning: [blastn] Examining 5 or more matches is recommended

Fri Nov 20 15:43:37 2020: Done.
Fri Nov 20 15:43:37 2020: Removing bad markers / samples…
[e] Phylogeny can not be inferred. Too many samples were discarded
Fri Nov 20 15:43:37 2020: Stop StrainPhlAn 3.0 execution.

Any recommendation would greatly be appreciated
Best
Maryam

Hi @MaryamTarazkar
For improving the quality of the phylogenetic reconstruction, StrainPhlAn performs a first screening and filtering step. By default, StrainPhlAn will discard samples with less than 20 markers (for the selected species, in your case C. trachomatis) and it will also discard markers present in less than 80% of the samples. These thresholds can modified using the paramenters sample_with_n_markers and marker_in_n_samples respectively. Even if you have enough samples, if your samples / reconstructed markers don’t pass these thresholds you will retrieve these kind of errors during execution.
For discover which species you can reconstruct with an specific set of thresholds, you can run StrainPhlAn with the parameter print_clades_only

I hope this helps.
Best,
Aitor