Having issues with Strainphlan

I downloaded metaphlan2 using conda and I am attempting to use strainphlan to profile my samples at the strain level.

I have been following the instructions on this page:
https://bitbucket.org/biobakery/biobakery/wiki/strainphlan

I successfully completed the sample2markers.py step, however I cannot run step 3: Identify clades detected in the samples and build reference databases. The command i am using is:

strainphlan.py --ifn_samples p100_bowtie2_aligned.markers --output_dir markers/ --print_clades_only > clades.txt

Which results in the following error:

Traceback (most recent call last):
File “/mnt/nfs/home/30041036/.conda/envs/metaphlan2/bin/strainphlan.py”, line 1585, in
strainphlan()
File “/mnt/nfs/home/30041036/.conda/envs/metaphlan2/bin/strainphlan.py”, line 1581, in strainphlan
strainer(args)
File “/mnt/nfs/home/30041036/.conda/envs/metaphlan2/bin/strainphlan.py”, line 1365, in strainer
db = pickle.load(bz2.BZ2File(args[‘mpa_pkl’]))
File “/mnt/nfs/home/30041036/.conda/envs/metaphlan2/lib/python3.7/bz2.py”, line 92, in init
self._fp = _builtin_open(filename, mode)
IsADirectoryError: [Errno 21] Is a directory: ‘/mnt/nfs/home/30041036/.conda/envs/metaphlan2/bin/metaphlan_databases’

How can I fix this? I can’t see that anyone else has had similar issues. Also in step 4 --ifn_markers s__Eubacterium_siraeum.markers.fasta is used in the command, how do I generate this fasta file for the species I am interested in (e.g. staphylococcus aureus)?

Hi mradz19,
Could you tell me the version of the MetaPhlAn2 database you used for the profiling?

For the question about the step 4, you should use the script extract_markers.py
You can take a look on this example (step 4): https://bitbucket.org/biobakery/metaphlan2/src/default/README.md#markdown-header-usage
For this script, remember to specify the correct metaphlan2 database version.

Best,
Aitor

Hi Aitor,

This is the version:

MetaPhlAn version 2.96.1 (02 Feb 2020)

Hi mradz19,
Try to add the param –index v296_CHOCOPhlAn_201901 to the strainphlan execution like:
strainphlan.py --ifn_samples p100_bowtie2_aligned.markers --output_dir markers/ --print_clades_only --index v296_CHOCOPhlAn_201901 > clades.txt

Best,
Aitor

1 Like

Hi Aitor,

I tried adding that parameter and got the same error message.

Hi Michael,
Could you check then the content of this folder: /mnt/nfs/home/30041036/.conda/envs/metaphlan2/bin/metaphlan_databases

Hi Aitor,

These are the contents of that folder:

mpa_latest
mpa_v295_CHOCOPhlAn_201901.1.bt2
mpa_v295_CHOCOPhlAn_201901.2.bt2
mpa_v295_CHOCOPhlAn_201901.3.bt2
mpa_v295_CHOCOPhlAn_201901.4.bt2
mpa_v295_CHOCOPhlAn_201901.fna.bz2
mpa_v295_CHOCOPhlAn_201901.md
mpa_v295_CHOCOPhlAn_201901.pkl
mpa_v295_CHOCOPhlAn_201901.rev.1.bt2
mpa_v295_CHOCOPhlAn_201901.rev.2.bt2
mpa_v295_CHOCOPhlAn_201901.tar

I chaged the --index tag to v295_CHOCOPhlan_201901 and the command ran, however the output .txt file is empty.

This is the log of the run:

strainphlan.py --ifn_samples p100_bowtie2_aligned.markers --output_dir markers/ --print_clades_only --index v295_CHOCOPhlAn_201901 > clades2.txt

2020-02-24 10:19:18,264 | INFO | main | strainer | 1364 | Load mpa_pkl

2020-02-24 10:19:30,288 | INFO | main | strainer | 1380 | Get clades from db

2020-02-24 10:19:31,878 | INFO | main | strainer | 1425 | Get clades from samples

2020-02-24 10:19:31,878 | DEBUG | main | load_sample | 1150 | load p100_bowtie2_aligned.markers

Hello
I have the same problem, nothing output when I run the last step strainphlan.py

#MetaPhlAn version 2.96.1 (02 Feb 2020)

# run metaphlan2 on the demo sample input files
>metaphlan2.py $INPUT_FOLDER/13530241_SF05.fasta.gz $OUTPUT_FOLDER/13530241_SF05_profile.txt --bowtie2out $OUTPUT_FOLDER/13530241_SF05_bowtie2.txt --samout $OUTPUT_FOLDER/13530241_SF05.sam.bz2 --input_type multifasta --index mpa_v296_CHOCOPhlAn_201901 --nproc $THREADS

Elapsed time to run MetaPhlAn2: 61.940003871917725 s


# run sample to markers on all of the samples
>sample2markers.py --ifn_samples $OUTPUT_FOLDER/13530241_SF05.sam.bz2 --input_type sam --output_dir $OUTPUT_FOLDER --nprocs $THREADS

/software/StrainPhlAn2/biobakery-biobakery-414eab928577/demos/biobakery_demos/data/strainphlan/output/13530241_SF05.sam.bz2 | samtools view -bS - | samtools sort - -o /software/StrainPhlAn2/biobakery-biobakery-414eab928577/demos/biobakery_demos/data/strainphlan/output/13530241_SF05.sam.bz2.bam.sorted | samtools mpileup -u - | bcftools view -c -g -p 1.1 - | fix_AF1.py --input_file - | vcfutils.pl vcf2fq



# run metaphlan2 strainer on all samples (add the flag to reduce the default as these are subsampled)
strainphlan.py --index v296_CHOCOPhlAn_201901 --ifn_samples $OUTPUT_FOLDER/*.markers --ifn_markers $INPUT_FOLDER/s__Eubacterium_siraeum.markers.fasta --ifn_ref_genomes $INPUT_FOLDER/GCF_000154325.fna.bz2 --output_dir $OUTPUT_FOLDER --nprocs_main $THREADS --clades s__Eubacterium_siraeum --marker_in_clade 0.2 --keep_alignment_files


2020-03-14 11:40:15,504 | INFO | __main__ | strainer | 1364 | Load mpa_pkl
2020-03-14 11:40:25,664 | INFO | __main__ | strainer | 1380 | Get clades from db
2020-03-14 11:40:28,017 | INFO | __main__ | strainer | 1444 | Add reference genomes
2020-03-14 11:40:28,032 | DEBUG | __main__ | add_ref_genomes | 617 | add 1 reference genomes
...
2020-03-14 11:40:28,495 | DEBUG | __main__ | filter_sequence | 475 | sample GCF_000154325, number of markers after N_in_marker: 150
sample GCF_000154325, number of markers after marker_strip_length: 150
2020-03-14 11:40:28,495 | DEBUG | __main__ | strainer | 1503 | remove samples with percentage of markers less than marker_in_clade
2020-03-14 11:40:28,495 | DEBUG | __main__ | build_tree | 850 | skip clade s__Eubacterium_siraeum because number of present samples is 1
2020-03-14 11:40:28,495 | INFO | __main__ | strainer | 1550 | Finished!




Hi Michael,
When StrainPhlAn is not able to return any clade could be due two main reasons:

  1. The database version you used for create the SAM file is different than the version you used for executing StrainPhlAn. This can be checked taking a look on the first line of the abundances report file generated together with the SAM file.
  2. The sample2markers script was not able to reconstruct enough markers for your sample.

If you could share your markers file I could take a deeper look on the problem.

Best,
Aitor

Hi CK_zhu,
As you can see in the lines returned by StrainPhlAn:

2020-03-14 11:40:28,495 | DEBUG | main | build_tree | 850 | skip clade s__Eubacterium_siraeum because number of present samples is 1

StrainPhlAn only detected the clade in one of your files, so the strain-level analysis is imposible to execute.

Best,
Aitor

Hi Aitor,
Where I can download the STRAINPHLAN_DB_REFERENCE and STRAINPHLAN_DB_MAKERS directly?Thank you very much!

Best regards

Li Zhihua

Hi @lzh1982
The StrainPhlAn markers’ database is same as the MetaPhlAn markers’ database.
If you installed MetaPhlAn 3 via conda, StrainPhlAn and the markers’ database will be also downloaded and installed, please check the tutorial for more info: https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-3.0#installation
If you have any issue with the conda installatioiin, you can also download the database from the following links:

Best,
Aitor

Hi Aitor,
Thank you very much for your explanation! I have installed biobakery_workflow through docker. I need know how to download the reference database not only maker database and install STRAINPHLAN_DB_REFERENCE manually? Would you help me?

Best regards

Li Zhihua

Hi,
You should set the the environmental variables STRAINPHLAN_DB_REFERENCE and STRAINPHLAN_DB_MAKERS inside your Docker instance using export.

STRAINPHLAN_DB_REFERENCE should point to the folder containing the reference genomes used when running StrainPhlAn and STRAINPHLAN_DB_MARKERS points to the folder containing the StrainPhlAn marker files.

Dear Dr.Francesco.Beghini,
Thank you very much for your explanation! I do not know where I can download the STRAINPHLAN_DB_REFERENCE database. Because I can not install directly using the order:biobakery_workflows --install wmgx, so I want to download the corresponding datatbase and install manually! Many thanks!

Best regards

Li Zhihua

You should retrieve the genomes of the species of interest from any genomic repository (e.g. Refseq)

1 Like