Strainphlan calls phylophlan as part of the workflow, and phylophlan assume an internet connection (ie., /phylophlan_configs/" folder does not exists and unable to download phylophlan_databases.txt?dl=1 · Issue #33 · biobakery/phylophlan · GitHub), so it would help to provide instructions on how to run strainphlan without an internet connection (eg., on a compute cluster).
Hi @nick-youngblut,
Thanks for getting in touch.
The "phylophlan_configs/" folder does not exists
error was already reported before by some users with older PhyloPhlAn versions. However, even if StrainPhlAn reports it as an error, it will actually work as a warning, and the execution will continue without any problem. If you want to remove the warning from the execution you can manually create the phylophlan_configs
folder inside your phylophlan installation folder.
For the other error unable to download phylophlan_databases.txt?dl=1
I would need a little bit more information about your MetaPhlAn / PhyloPhlAn installation. Did you installed it using conda?, which versions of metaphlan (metaphlan --version) and phylophlan (phylophlan --version) have you installed?
Thanks,
Aitor
Hi @aitor.blancomiguez! Thanks for the quick reply!
When I run strainphlan3 (metaphlan 3.0.7, bioconda; phylophlan 3.0.1, bioconda) with an internet connection, strainphlan completes successfully. However, when I run a new install of strainphlan (metaphlan 3.0.7, bioconda) with no internet connection (so no possibility for the software to download any setup files), the resulting error is:
Mon Jan 18 14:16:14 2021: Start StrainPhlAn 3.0 execution
Mon Jan 18 14:16:14 2021: Creating temporary directory...
Mon Jan 18 14:16:14 2021: Done.
Mon Jan 18 14:16:14 2021: Getting markers from main sample files...
Mon Jan 18 14:16:14 2021: Done.
Mon Jan 18 14:16:14 2021: Getting markers from main reference files...
Mon Jan 18 14:16:14 2021: Done.
Mon Jan 18 14:16:14 2021: Removing bad markers / samples...
Mon Jan 18 14:16:14 2021: Done.
Mon Jan 18 14:16:14 2021: Writing samples as markers' FASTA files...
Mon Jan 18 14:16:14 2021: Done.
Mon Jan 18 14:16:14 2021: Writing filtered clade markers as FASTA file...
Mon Jan 18 14:16:14 2021: Done.
Mon Jan 18 14:16:14 2021: Calculating polymorphic rates...
Mon Jan 18 14:16:14 2021: Done.
Mon Jan 18 14:16:14 2021: Executing PhyloPhlAn 3.0...
Mon Jan 18 14:16:14 2021: Creating PhyloPhlAn 3.0 database...
Mon Jan 18 14:16:15 2021: Done.
Mon Jan 18 14:16:15 2021: Generating PhyloPhlAn 3.0 configuration file...
Mon Jan 18 14:16:15 2021: Done.
Mon Jan 18 14:16:15 2021: Processing samples...[e] "/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/lib/python3.8/site-packages/phylophlan/phylophlan_configs/" folder does not exists
[e] Command '['/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/bin/makeblastdb', '-parse_seqids', '-dbtype', 'nucl', '-in', 'tests/output_amy_n8/Strainphlan3/s__Faecalibacterium_prausnitzii/strainphlan/tmp1dgzrhfg/s__Faecalibacterium_prausnitzii/s__Faecalibacterium_prausnitzii.fna', '-out', 'tests/output_amy_n8/Strainphlan3/s__Faecalibacterium_prausnitzii/strainphlan/tmp1dgzrhfg/s__Faecalibacterium_prausnitzii/s__Faecalibacterium_prausnitzii']' returned non-zero exit status 255.
[e] cannot execute command
command_line: /ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/bin/makeblastdb -parse_seqids -dbtype nucl -in tests/output_amy_n8/Strainphlan3/s__Faecalibacterium_prausnitzii/strainphlan/tmp1dgzrhfg/s__Faecalibacterium_prausnitzii/s__Faecalibacterium_prausnitzii.fna -out tests/output_amy_n8/Strainphlan3/s__Faecalibacterium_prausnitzii/strainphlan/tmp1dgzrhfg/s__Faecalibacterium_prausnitzii/s__Faecalibacterium_prausnitzii
stdin: None
stdout: None
env: {'REQNAME': 'snakejob.sp3_strainphlan3.41.sh', 'CONDA_PROMPT_MODIFIER': '(/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a) ', 'JOB_ID': '1909545', 'MAIL': '/var/mail/root', 'USER': 'nyoungblut', 'SSH_CLIENT': '172.18.3.229 60058 22', 'SGE_CWD_PATH': '/ebio/abt3_projects/software/dev/ll_pipelines/llmgps', 'LC_TIME': 'C', 'HOSTNAME': 'node509', 'BLASTDB': '/ebio/abt3_projects/databases_no-backup/NCBI_blastdb/', 'NQUEUES': '1', 'OPENBLAS_NUM_THREADS': '8', 'SGE_TASK_ID': 'undefined', 'SHLVL': '4', 'SGE_O_MAIL': '/var/mail/nyoungblut', 'CONDA_SHLVL': '2', 'OLDPWD': '/ebio/abt3_projects/software/dev/ll_pipelines/llmgps', 'HOME': '/ebio/abt3/nyoungblut', 'ARC': 'lx-amd64', 'ENVIRONMENT': 'BATCH', 'SSH_TTY': '/dev/pts/0', 'QUEUE': 'long.q', 'RESTARTED': '0', 'LC_MONETARY': 'C', 'SGE_STDIN_PATH': '/dev/null', 'LC_CTYPE': 'en_US.UTF-8', 'NHOSTS': '1', 'SGE_O_HOME': '/ebio/abt3/nyoungblut', '_CE_M': '', 'SGE_O_SHELL': '/bin/bash', 'TMPDIR': '/tmp/1909545.1.long.q', 'SGE_STDERR_PATH': '/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/tests/output_amy_n8/logs/sp3/strainphlan3/s__Faecalibacterium_prausnitzii_sge.err', 'NSLOTS': '14', 'SGE_O_PATH': '/ebio/abt3_projects/software/dev/miniconda3_dev/envs/snakemake/bin:/ebio/abt3_projects/software/dev/miniconda3_dev/condabin:/ebio/abt3_projects/software/bin/singularity/bin:/ebio/abt3_projects/software/bin/go/bin:/ebio/abt3/nyoungblut/bin/iterm2:/ebio/abt3/nyoungblut/bin/direnv:/ebio/abt3/nyoungblut/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'LOGNAME': 'nyoungblut', 'SGE_TASK_LAST': 'undefined', '_': '/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/bin/strainphlan', 'JOB_SCRIPT': '/var/spool/gridengine/execd/node509/job_scripts/1909545', 'TERM': '', 'UDOCKER_DIR': '/ebio/abt3_projects/software/dev/udocker/', '_CE_CONDA': '', 'LC_COLLATE': 'en_US.UTF-8', 'SGE_ROOT': '/var/lib/gridengine', 'PATH': '/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/bin:/ebio/abt3_projects/software/dev/miniconda3_dev/envs/snakemake/bin:/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/bin:/ebio/abt3_projects/software/miniconda3/condabin:/ebio/abt3_projects/software/bin/singularity/bin:/ebio/abt3_projects/software/bin/go/bin:/ebio/abt3/nyoungblut/bin/iterm2:/ebio/abt3/nyoungblut/bin/direnv:/ebio/abt3/nyoungblut/bin:/usr/local/bin:/usr/bin:/bin', 'SGE_ARCH': 'lx-amd64', 'HOST_LOCATION': 'Rack03', 'SGE_TASK_FIRST': 'undefined', 'LC_ADDRESS': 'C', 'GOTO_NUM_THREADS': '8', 'SGE_CELL': 'default', 'SGE_JOB_SPOOL_DIR': '/var/spool/gridengine/execd/node509/active_jobs/1909545.1', 'PE_HOSTFILE': '/var/spool/gridengine/execd/node509/active_jobs/1909545.1/pe_hostfile', 'LANG': 'en_US.UTF-8', 'CONDA_PREFIX_1': '/ebio/abt3_projects/software/miniconda3', 'LC_TELEPHONE': 'C', 'HISTSIZE': '2000', 'LS_COLORS': 'no=00:fi=00:di=01;32:ln=01:pi=04;44;33:so=01;35:bd=40;33;01:cd=40;33;01:ex=01;31:*.btm=01;32:*.tar=00;36:*.tgz=01;36:*.rpm=00;36:*.xz=01;36:*.taz=00;31:*.lzh=00;31:*.zip=01;36:*.Z=00;31:*.gz=01;36:*.bz2=01;36:*.jpg=01;35:*.gif=01;35:*.bmp=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.png=01;35:*.mp3=01;37:*.ogg=01;37:*.m4a=01;37:*.wma=01;37:*.flac=01;37:*.opus=01;37:*.spx=01;37', 'TMP': '/tmp/1909545.1.long.q', 'PE': 'parallel', 'VECLIB_MAXIMUM_THREADS': '8', 'SGE_O_WORKDIR': '/ebio/abt3_projects/software/dev/ll_pipelines/llmgps', 'CONDA_PYTHON_EXE': '/ebio/abt3_projects/software/miniconda3/bin/python', 'SGE_ACCOUNT': 'sge', 'LC_MESSAGES': 'C', 'LC_NAME': 'C', 'SGE_TASK_STEPSIZE': 'undefined', 'SHELL': '/bin/bash', 'CONDA_DEFAULT_ENV': '/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a', 'SGE_STDOUT_PATH': '/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/tests/output_amy_n8/logs/sp3/strainphlan3/s__Faecalibacterium_prausnitzii_sge.out', 'LC_MEASUREMENT': 'C', 'NUMEXPR_NUM_THREADS': '8', 'LC_IDENTIFICATION': 'C', 'SGE_O_HOST': 'rick', 'REQUEST': 'snakejob.sp3_strainphlan3.41.sh', 'JOB_NAME': 'snakejob.sp3_strainphlan3.41.sh', 'LC_ALL': 'en_US.UTF-8', 'PWD': '/ebio/abt3_projects/software/dev/ll_pipelines/llmgps', 'CONDA_EXE': '/ebio/abt3_projects/software/miniconda3/bin/conda', 'SSH_CONNECTION': '172.18.3.229 60058 172.18.3.98 22', 'MKL_NUM_THREADS': '8', 'SGE_O_LOGNAME': 'nyoungblut', 'LC_NUMERIC': 'C', 'OMP_NUM_THREADS': '8', 'LC_PAPER': 'C', 'TZ': 'MET', 'CONDA_PREFIX': '/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a', 'SGE_BINARY_PATH': '/usr/sbin/lx-amd64'}
…which seems to indicate that lib/python3.8/site-packages/phylophlan/phylophlan_configs/
is not created if there is no internet connection.
I’ve also manually checked, and indeed, the phylophlan_configs
directory does not exist when running strainphlan without an internet connection.
My conda env setup for that strainphlan env is simply:
channels:
- conda-forge
- bioconda
dependencies:
- pigz
- conda-forge::scikit-learn
- bioconda::seqkit
- bioconda::metaphlan>=3.0.1
Ah, I now see that it’s the /ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/bin/makeblastdb
child job that is causing the problems. I don’t think that it is a memory issue, but I’ll try using more.
It turns out that, for some reason, makeblastdb
generates Error: mdb_env_open: Cannot allocate memory
errors even when providing 300G of memory for the cluster job. The input fasta only has 38 sequences (47K file size). This memory issue for blast 2.10.1 is causing strainphlan to die.
Hi @nick-youngblut
It actually is a really strange error. Let us know if you find a solution.
Best,
Aitor
I was told by blast dev support that 2.11.0 would fix the issue, but qsub jobs with makeblastdb still throw a cannot allocate memory
error, even when creating a db of just one bacterial genome and providing 20G of memory for the job.
My problem is that Strainphlan3 uses more than it’s designated number of threads (via --nproc {threads}
) when run locally, such as generating 100’s of tbfast
jobs that seem to spawn from raxml
when it is called by strainphlan
. So, I’d like to run the jobs in parallel on our cluster where the threads are limited by the job resources, but there’s the makeblastdb
memory error.
So either I run up the local server to 400-500X load on an 80 thread machine or get memory error when running strainphlan
on our cluster due to the makeblastdb
issue.
Is there any way to provide the blastdb as input to strainphlan
so that it doesn’t have to be computed during the job? Either that, or to split the job into the blast portion and the phylo inference portion?
Hi @nick-youngblut
The problem you are experimenting with the tbfast
jobs is really weird. tbfast
should be part of the mafft
execution PhyloPhlAn calls internally, and it uses as the --thread
parameter the StrainPhlAn’s --nproc
. Could you please report us the PhyloPhlAn version you are using and the PhyloPhlAn configuration file StrainPhlAn is creating?
FYI: I think that I found the source of the issue with running makeblastdb on our cluster. From the blast docs:
Starting with the 2.10.0 release, makeblastdb produces version 5 databases by default, which uses LMDB. LMDB requires virtual memory (at least 600 GB, but 800 GB is recommended). Virtual memory is just that (virtual) and doesn’t depend on the hardware in your system. In general, we recommend that BLAST users simply set the virtual memory to unlimited.
I tested this, and it is true: I have to provide >= 600G for -l h_vmem
in order to get makeblastdb
to run successfully. However, that this the only cluster resource for memory management, so this greatly increases the wait in the job queue.
It might be good to move to something else other than blast or allow for blast database construction in a separate step from the rest of the strainphlan algorithm.
My conda env:
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 1_gnu conda-forge
bcbio-gff 0.6.6 pyh864c0ab_1 bioconda
biom-format 2.1.10 py38h0b5ebd8_0 conda-forge
biopython 1.78 py38h497a2fe_1 conda-forge
blast 2.10.1 pl526he19e7b1_3 bioconda
boost-cpp 1.70.0 h7b93d67_3 conda-forge
bowtie2 2.4.2 py38h1c8e9b9_1 bioconda
brotlipy 0.7.0 py38h497a2fe_1001 conda-forge
bx-python 0.8.9 py38hb90e610_2 bioconda
bzip2 1.0.8 h7f98852_4 conda-forge
c-ares 1.17.1 h36c2ea0_0 conda-forge
ca-certificates 2021.1.19 h06a4308_0
cached-property 1.5.2 py_0
capnproto 0.6.1 hfc679d8_1 conda-forge
certifi 2020.12.5 py38h578d9bd_1 conda-forge
cffi 1.14.4 py38ha65f79e_1 conda-forge
chardet 4.0.0 py38h578d9bd_1 conda-forge
click 7.1.2 pyh9f0ad1d_0 conda-forge
cmseq 1.0.2 pyh7b7c402_0 bioconda
cryptography 3.3.1 py38h2b97feb_1 conda-forge
curl 7.71.1 he644dc0_8 conda-forge
cycler 0.10.0 py_2 conda-forge
dendropy 4.5.1 pyh3252c3a_0 bioconda
diamond 2.0.6 h56fc30b_0 bioconda
entrez-direct 13.9 pl526h375a9b1_0 bioconda
expat 2.2.10 he6710b0_2
fasttree 2.1.10 h516909a_4 bioconda
freetype 2.10.4 h0708190_1 conda-forge
future 0.18.2 py38h578d9bd_3 conda-forge
gsl 2.6 he838d99_2 conda-forge
h5py 3.1.0 nompi_py38hafa665b_100 conda-forge
hdf5 1.10.6 nompi_h6a2412b_1114 conda-forge
htslib 1.11 hd3b49d5_1 bioconda
icu 67.1 he1b5a44_0 conda-forge
idna 2.10 pyh9f0ad1d_0 conda-forge
iqtree 2.0.3 h176a8bc_1 bioconda
joblib 1.0.0 pyhd8ed1ab_0 conda-forge
jpeg 9d h516909a_0 conda-forge
kiwisolver 1.3.1 py38h1fd1430_1 conda-forge
krb5 1.17.2 h926e7f8_0 conda-forge
lcms2 2.11 hcbb858e_1 conda-forge
ld_impl_linux-64 2.35.1 hed1e6ac_1 conda-forge
libblas 3.9.0 7_openblas conda-forge
libcblas 3.9.0 7_openblas conda-forge
libcurl 7.71.1 hcdd3856_8 conda-forge
libdeflate 1.6 h516909a_0 conda-forge
libedit 3.1.20191231 he28a2e2_2 conda-forge
libev 4.33 h516909a_1 conda-forge
libffi 3.3 h58526e2_2 conda-forge
libgcc-ng 9.3.0 h2828fa1_18 conda-forge
libgfortran-ng 9.3.0 hff62375_18 conda-forge
libgfortran5 9.3.0 hff62375_18 conda-forge
libgomp 9.3.0 h2828fa1_18 conda-forge
libiconv 1.16 h516909a_0 conda-forge
liblapack 3.9.0 7_openblas conda-forge
libnghttp2 1.41.0 h8cfc5f6_2 conda-forge
libopenblas 0.3.12 pthreads_h4812303_1 conda-forge
libpng 1.6.37 hed695b0_2 conda-forge
libssh2 1.9.0 hab1572f_5 conda-forge
libstdcxx-ng 9.3.0 h6de172a_18 conda-forge
libtiff 4.2.0 hdc55705_0 conda-forge
libwebp-base 1.1.0 h516909a_3 conda-forge
libxml2 2.9.10 h68273f3_2 conda-forge
lz4-c 1.9.3 h9c3ff4c_0 conda-forge
lzo 2.10 h516909a_1000 conda-forge
mafft 7.475 h516909a_0 bioconda
mash 2.2.2 ha61e061_2 bioconda
matplotlib-base 3.3.3 py38h5c7f4ab_0 conda-forge
metaphlan 3.0.7 pyh7b7c402_0 bioconda
muscle 3.8.1551 hc9558a2_5 bioconda
ncurses 6.2 h58526e2_4 conda-forge
newick_utils 1.6 h516909a_3 bioconda
numpy 1.19.5 py38h18fd61f_1 conda-forge
olefile 0.46 pyh9f0ad1d_1 conda-forge
openssl 1.1.1i h7f98852_0 conda-forge
pandas 1.2.0 py38h51da96c_1 conda-forge
patsy 0.5.1 py_0 conda-forge
pcre 8.44 he1b5a44_0 conda-forge
perl 5.26.2 h36c2ea0_1008 conda-forge
perl-app-cpanminus 1.7044 pl526_1 bioconda
perl-archive-tar 2.32 pl526_0 bioconda
perl-base 2.23 pl526_1 bioconda
perl-business-isbn 3.004 pl526_0 bioconda
perl-business-isbn-data 20140910.003 pl526_0 bioconda
perl-carp 1.38 pl526_3 bioconda
perl-common-sense 3.74 pl526_2 bioconda
perl-compress-raw-bzip2 2.087 pl526he1b5a44_0 bioconda
perl-compress-raw-zlib 2.087 pl526hc9558a2_0 bioconda
perl-constant 1.33 pl526_1 bioconda
perl-data-dumper 2.173 pl526_0 bioconda
perl-digest-hmac 1.03 pl526_3 bioconda
perl-digest-md5 2.55 pl526_0 bioconda
perl-encode 2.88 pl526_1 bioconda
perl-encode-locale 1.05 pl526_6 bioconda
perl-exporter 5.72 pl526_1 bioconda
perl-exporter-tiny 1.002001 pl526_0 bioconda
perl-extutils-makemaker 7.36 pl526_1 bioconda
perl-file-listing 6.04 pl526_1 bioconda
perl-file-path 2.16 pl526_0 bioconda
perl-file-temp 0.2304 pl526_2 bioconda
perl-html-parser 3.72 pl526h6bb024c_5 bioconda
perl-html-tagset 3.20 pl526_3 bioconda
perl-html-tree 5.07 pl526_1 bioconda
perl-http-cookies 6.04 pl526_0 bioconda
perl-http-daemon 6.01 pl526_1 bioconda
perl-http-date 6.02 pl526_3 bioconda
perl-http-message 6.18 pl526_0 bioconda
perl-http-negotiate 6.01 pl526_3 bioconda
perl-io-compress 2.087 pl526he1b5a44_0 bioconda
perl-io-html 1.001 pl526_2 bioconda
perl-io-socket-ssl 2.066 pl526_0 bioconda
perl-io-zlib 1.10 pl526_2 bioconda
perl-json 4.02 pl526_0 bioconda
perl-json-xs 2.34 pl526h6bb024c_3 bioconda
perl-libwww-perl 6.39 pl526_0 bioconda
perl-list-moreutils 0.428 pl526_1 bioconda
perl-list-moreutils-xs 0.428 pl526_0 bioconda
perl-lwp-mediatypes 6.04 pl526_0 bioconda
perl-lwp-protocol-https 6.07 pl526_4 bioconda
perl-mime-base64 3.15 pl526_1 bioconda
perl-mozilla-ca 20180117 pl526_1 bioconda
perl-net-http 6.19 pl526_0 bioconda
perl-net-ssleay 1.88 pl526h90d6eec_0 bioconda
perl-ntlm 1.09 pl526_4 bioconda
perl-parent 0.236 pl526_1 bioconda
perl-pathtools 3.75 pl526h14c3975_1 bioconda
perl-scalar-list-utils 1.52 pl526h516909a_0 bioconda
perl-socket 2.027 pl526_1 bioconda
perl-storable 3.15 pl526h14c3975_0 bioconda
perl-test-requiresinternet 0.05 pl526_0 bioconda
perl-time-local 1.28 pl526_1 bioconda
perl-try-tiny 0.30 pl526_1 bioconda
perl-types-serialiser 1.0 pl526_2 bioconda
perl-uri 1.76 pl526_0 bioconda
perl-www-robotrules 6.02 pl526_3 bioconda
perl-xml-namespacesupport 1.12 pl526_0 bioconda
perl-xml-parser 2.44_01 pl526ha1d75be_1002 conda-forge
perl-xml-sax 1.02 pl526_0 bioconda
perl-xml-sax-base 1.09 pl526_0 bioconda
perl-xml-sax-expat 0.51 pl526_3 bioconda
perl-xml-simple 2.25 pl526_1 bioconda
perl-xsloader 0.24 pl526_0 bioconda
phylophlan 3.0.1 py_0 bioconda
pigz 2.4 h84994c4_0
pillow 8.1.0 py38h357d4e7_1 conda-forge
pip 20.3.3 pyhd8ed1ab_0 conda-forge
pycparser 2.20 pyh9f0ad1d_2 conda-forge
pyopenssl 20.0.1 pyhd8ed1ab_0 conda-forge
pyparsing 2.4.7 pyh9f0ad1d_0 conda-forge
pysam 0.16.0.1 py38hbdc2ae9_1 bioconda
pysocks 1.7.1 py38h578d9bd_3 conda-forge
python 3.8.6 hffdb5ce_4_cpython conda-forge
python-dateutil 2.8.1 py_0 conda-forge
python-lzo 1.12 py38h86e1cee_1003 conda-forge
python_abi 3.8 1_cp38 conda-forge
pytz 2020.5 pyhd8ed1ab_0 conda-forge
raxml 8.2.12 h516909a_2 bioconda
readline 8.0 he28a2e2_2 conda-forge
requests 2.25.1 pyhd3deb0d_0 conda-forge
samtools 1.11 h6270b1f_0 bioconda
scikit-learn 0.24.0 py38h658cfdd_0 conda-forge
scipy 1.6.0 py38hb2138dd_0 conda-forge
seaborn 0.11.1 ha770c72_0 conda-forge
seaborn-base 0.11.1 pyhd8ed1ab_0 conda-forge
seqkit 0.15.0 0 bioconda
setuptools 51.1.2 py38h06a4308_4
six 1.15.0 pyh9f0ad1d_0 conda-forge
sqlite 3.34.0 h74cdb3f_0 conda-forge
statsmodels 0.12.1 py38h5c078b8_2 conda-forge
tbb 2020.3 hfd86e86_0
threadpoolctl 2.1.0 pyh5ca1d4c_0 conda-forge
tk 8.6.10 hed695b0_1 conda-forge
tornado 6.1 py38h497a2fe_1 conda-forge
trimal 1.4.1 hc9558a2_4 bioconda
urllib3 1.26.2 pyhd8ed1ab_0 conda-forge
wheel 0.36.2 pyhd3deb0d_0 conda-forge
xz 5.2.5 h516909a_1 conda-forge
zlib 1.2.11 h516909a_1010 conda-forge
zstd 1.4.8 ha95c52a_1 conda-forge
An example phylophlan config:
[db_dna]
program_name = /ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/bin/makeblastdb
params = -parse_seqids -dbtype nucl
input = -in
output = -out
version = -version
command_line = #program_name# #params# #input# #output#
[map_dna]
program_name = /ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/bin/blastn
params = -outfmt 6 -evalue 0.1 -max_target_seqs 1000000 -perc_identity 75
input = -query
database = -db
output = -out
version = -version
command_line = #program_name# #params# #input# #database# #output#
[msa]
program_name = /ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/bin/mafft
params = --quiet --anysymbol --thread 1 --auto
version = --version
command_line = #program_name# #params# #input# > #output#
environment = TMPDIR=/tmp
[trim]
program_name = /ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/bin/trimal
params = -gappyout
input = -in
output = -out
version = --version
command_line = #program_name# #params# #input# #output#
[tree1]
program_name = /ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/bin/raxmlHPC-PTHREADS-SSE3
params = -p 1989 -m GTRGAMMA -N 100 -f a -x 14341
input = -s
output_path = -w
output = -n
version = -v
command_line = #program_name# #params# #threads# #output_path# #input# #output#
threads = -T
Hi @nick-youngblut
I see, yes, for a cluster architecture 600GB of virtual memory will slow down all your processing. I would say you can explore two possibilities:
- Install an older BLAST version, I’m currently using the version 2.9.0 for my own analyses.
- Provide a custom phylophlan config file that uses DIAMOND instead of BLAST.
For the mafft jobs problem, it is really weird, we never got reported something similar before (and the conf. file looks fine). Recently we updated the PhyloPhlAn conda package to the version 3.0.2, could you please try if you get the same problem with the last fixes?
DIAMOND doesn’t work for nucleotide-nucleotide searches, correct? So, it won’t work for DIAMOND, right?
In regards to the older BLAST version, this is a short-term solution, unless you will always support older versions of BLAST
A recent cluster job with strainphlan didn’t complete in 12 hours, but since I had to specify 600G of vmem, the cumulative memory usage of the job was >160 Tb
Hi @nick-youngblut
You are right, DIAMOND will only work with amynoacid or translated nucleotide sequences. My fault.
Right now, the usage of an older BLAST version seems to be then the best solution. I will get in contact with the PhyloPhlAn developers to better assess this version problem. We will let you know
OK. Thanks for confirming. I didn’t think that Benjamin had added nucleotide-nucleotide searches to diamond. He’s more concerned about scaling the existing diamond algorithms.
I’ve contacted BLAST support about the memory issue to see if they are planning on changing it in the (near) future, but I haven’t heard back.