Strainphlan without an internet connection

Strainphlan calls phylophlan as part of the workflow, and phylophlan assume an internet connection (ie., /phylophlan_configs/" folder does not exists and unable to download phylophlan_databases.txt?dl=1 · Issue #33 · biobakery/phylophlan · GitHub), so it would help to provide instructions on how to run strainphlan without an internet connection (eg., on a compute cluster).

Hi @nick-youngblut,
Thanks for getting in touch.
The "phylophlan_configs/" folder does not exists error was already reported before by some users with older PhyloPhlAn versions. However, even if StrainPhlAn reports it as an error, it will actually work as a warning, and the execution will continue without any problem. If you want to remove the warning from the execution you can manually create the phylophlan_configs folder inside your phylophlan installation folder.
For the other error unable to download phylophlan_databases.txt?dl=1 I would need a little bit more information about your MetaPhlAn / PhyloPhlAn installation. Did you installed it using conda?, which versions of metaphlan (metaphlan --version) and phylophlan (phylophlan --version) have you installed?

Thanks,
Aitor

Hi @aitor.blancomiguez! Thanks for the quick reply!

When I run strainphlan3 (metaphlan 3.0.7, bioconda; phylophlan 3.0.1, bioconda) with an internet connection, strainphlan completes successfully. However, when I run a new install of strainphlan (metaphlan 3.0.7, bioconda) with no internet connection (so no possibility for the software to download any setup files), the resulting error is:

Mon Jan 18 14:16:14 2021: Start StrainPhlAn 3.0 execution
Mon Jan 18 14:16:14 2021: Creating temporary directory...
Mon Jan 18 14:16:14 2021: Done.
Mon Jan 18 14:16:14 2021: Getting markers from main sample files...
Mon Jan 18 14:16:14 2021: Done.
Mon Jan 18 14:16:14 2021: Getting markers from main reference files...
Mon Jan 18 14:16:14 2021: Done.
Mon Jan 18 14:16:14 2021: Removing bad markers / samples...
Mon Jan 18 14:16:14 2021: Done.
Mon Jan 18 14:16:14 2021: Writing samples as markers' FASTA files...
Mon Jan 18 14:16:14 2021: Done.
Mon Jan 18 14:16:14 2021: Writing filtered clade markers as FASTA file...
Mon Jan 18 14:16:14 2021: Done.
Mon Jan 18 14:16:14 2021: Calculating polymorphic rates...
Mon Jan 18 14:16:14 2021: Done.
Mon Jan 18 14:16:14 2021: Executing PhyloPhlAn 3.0...
Mon Jan 18 14:16:14 2021: 	Creating PhyloPhlAn 3.0 database...
Mon Jan 18 14:16:15 2021: 	Done.
Mon Jan 18 14:16:15 2021: 	Generating PhyloPhlAn 3.0 configuration file...
Mon Jan 18 14:16:15 2021: 	Done.
Mon Jan 18 14:16:15 2021: 	Processing samples...[e] "/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/lib/python3.8/site-packages/phylophlan/phylophlan_configs/" folder does not exists

[e] Command '['/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/bin/makeblastdb', '-parse_seqids', '-dbtype', 'nucl', '-in', 'tests/output_amy_n8/Strainphlan3/s__Faecalibacterium_prausnitzii/strainphlan/tmp1dgzrhfg/s__Faecalibacterium_prausnitzii/s__Faecalibacterium_prausnitzii.fna', '-out', 'tests/output_amy_n8/Strainphlan3/s__Faecalibacterium_prausnitzii/strainphlan/tmp1dgzrhfg/s__Faecalibacterium_prausnitzii/s__Faecalibacterium_prausnitzii']' returned non-zero exit status 255.

[e] cannot execute command
    command_line: /ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/bin/makeblastdb -parse_seqids -dbtype nucl -in tests/output_amy_n8/Strainphlan3/s__Faecalibacterium_prausnitzii/strainphlan/tmp1dgzrhfg/s__Faecalibacterium_prausnitzii/s__Faecalibacterium_prausnitzii.fna -out tests/output_amy_n8/Strainphlan3/s__Faecalibacterium_prausnitzii/strainphlan/tmp1dgzrhfg/s__Faecalibacterium_prausnitzii/s__Faecalibacterium_prausnitzii
           stdin: None
          stdout: None
             env: {'REQNAME': 'snakejob.sp3_strainphlan3.41.sh', 'CONDA_PROMPT_MODIFIER': '(/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a) ', 'JOB_ID': '1909545', 'MAIL': '/var/mail/root', 'USER': 'nyoungblut', 'SSH_CLIENT': '172.18.3.229 60058 22', 'SGE_CWD_PATH': '/ebio/abt3_projects/software/dev/ll_pipelines/llmgps', 'LC_TIME': 'C', 'HOSTNAME': 'node509', 'BLASTDB': '/ebio/abt3_projects/databases_no-backup/NCBI_blastdb/', 'NQUEUES': '1', 'OPENBLAS_NUM_THREADS': '8', 'SGE_TASK_ID': 'undefined', 'SHLVL': '4', 'SGE_O_MAIL': '/var/mail/nyoungblut', 'CONDA_SHLVL': '2', 'OLDPWD': '/ebio/abt3_projects/software/dev/ll_pipelines/llmgps', 'HOME': '/ebio/abt3/nyoungblut', 'ARC': 'lx-amd64', 'ENVIRONMENT': 'BATCH', 'SSH_TTY': '/dev/pts/0', 'QUEUE': 'long.q', 'RESTARTED': '0', 'LC_MONETARY': 'C', 'SGE_STDIN_PATH': '/dev/null', 'LC_CTYPE': 'en_US.UTF-8', 'NHOSTS': '1', 'SGE_O_HOME': '/ebio/abt3/nyoungblut', '_CE_M': '', 'SGE_O_SHELL': '/bin/bash', 'TMPDIR': '/tmp/1909545.1.long.q', 'SGE_STDERR_PATH': '/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/tests/output_amy_n8/logs/sp3/strainphlan3/s__Faecalibacterium_prausnitzii_sge.err', 'NSLOTS': '14', 'SGE_O_PATH': '/ebio/abt3_projects/software/dev/miniconda3_dev/envs/snakemake/bin:/ebio/abt3_projects/software/dev/miniconda3_dev/condabin:/ebio/abt3_projects/software/bin/singularity/bin:/ebio/abt3_projects/software/bin/go/bin:/ebio/abt3/nyoungblut/bin/iterm2:/ebio/abt3/nyoungblut/bin/direnv:/ebio/abt3/nyoungblut/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'LOGNAME': 'nyoungblut', 'SGE_TASK_LAST': 'undefined', '_': '/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/bin/strainphlan', 'JOB_SCRIPT': '/var/spool/gridengine/execd/node509/job_scripts/1909545', 'TERM': '', 'UDOCKER_DIR': '/ebio/abt3_projects/software/dev/udocker/', '_CE_CONDA': '', 'LC_COLLATE': 'en_US.UTF-8', 'SGE_ROOT': '/var/lib/gridengine', 'PATH': '/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/bin:/ebio/abt3_projects/software/dev/miniconda3_dev/envs/snakemake/bin:/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/bin:/ebio/abt3_projects/software/miniconda3/condabin:/ebio/abt3_projects/software/bin/singularity/bin:/ebio/abt3_projects/software/bin/go/bin:/ebio/abt3/nyoungblut/bin/iterm2:/ebio/abt3/nyoungblut/bin/direnv:/ebio/abt3/nyoungblut/bin:/usr/local/bin:/usr/bin:/bin', 'SGE_ARCH': 'lx-amd64', 'HOST_LOCATION': 'Rack03', 'SGE_TASK_FIRST': 'undefined', 'LC_ADDRESS': 'C', 'GOTO_NUM_THREADS': '8', 'SGE_CELL': 'default', 'SGE_JOB_SPOOL_DIR': '/var/spool/gridengine/execd/node509/active_jobs/1909545.1', 'PE_HOSTFILE': '/var/spool/gridengine/execd/node509/active_jobs/1909545.1/pe_hostfile', 'LANG': 'en_US.UTF-8', 'CONDA_PREFIX_1': '/ebio/abt3_projects/software/miniconda3', 'LC_TELEPHONE': 'C', 'HISTSIZE': '2000', 'LS_COLORS': 'no=00:fi=00:di=01;32:ln=01:pi=04;44;33:so=01;35:bd=40;33;01:cd=40;33;01:ex=01;31:*.btm=01;32:*.tar=00;36:*.tgz=01;36:*.rpm=00;36:*.xz=01;36:*.taz=00;31:*.lzh=00;31:*.zip=01;36:*.Z=00;31:*.gz=01;36:*.bz2=01;36:*.jpg=01;35:*.gif=01;35:*.bmp=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.png=01;35:*.mp3=01;37:*.ogg=01;37:*.m4a=01;37:*.wma=01;37:*.flac=01;37:*.opus=01;37:*.spx=01;37', 'TMP': '/tmp/1909545.1.long.q', 'PE': 'parallel', 'VECLIB_MAXIMUM_THREADS': '8', 'SGE_O_WORKDIR': '/ebio/abt3_projects/software/dev/ll_pipelines/llmgps', 'CONDA_PYTHON_EXE': '/ebio/abt3_projects/software/miniconda3/bin/python', 'SGE_ACCOUNT': 'sge', 'LC_MESSAGES': 'C', 'LC_NAME': 'C', 'SGE_TASK_STEPSIZE': 'undefined', 'SHELL': '/bin/bash', 'CONDA_DEFAULT_ENV': '/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a', 'SGE_STDOUT_PATH': '/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/tests/output_amy_n8/logs/sp3/strainphlan3/s__Faecalibacterium_prausnitzii_sge.out', 'LC_MEASUREMENT': 'C', 'NUMEXPR_NUM_THREADS': '8', 'LC_IDENTIFICATION': 'C', 'SGE_O_HOST': 'rick', 'REQUEST': 'snakejob.sp3_strainphlan3.41.sh', 'JOB_NAME': 'snakejob.sp3_strainphlan3.41.sh', 'LC_ALL': 'en_US.UTF-8', 'PWD': '/ebio/abt3_projects/software/dev/ll_pipelines/llmgps', 'CONDA_EXE': '/ebio/abt3_projects/software/miniconda3/bin/conda', 'SSH_CONNECTION': '172.18.3.229 60058 172.18.3.98 22', 'MKL_NUM_THREADS': '8', 'SGE_O_LOGNAME': 'nyoungblut', 'LC_NUMERIC': 'C', 'OMP_NUM_THREADS': '8', 'LC_PAPER': 'C', 'TZ': 'MET', 'CONDA_PREFIX': '/ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a', 'SGE_BINARY_PATH': '/usr/sbin/lx-amd64'}

…which seems to indicate that lib/python3.8/site-packages/phylophlan/phylophlan_configs/ is not created if there is no internet connection.

I’ve also manually checked, and indeed, the phylophlan_configs directory does not exist when running strainphlan without an internet connection.

My conda env setup for that strainphlan env is simply:

channels:
- conda-forge
- bioconda
dependencies:
- pigz
- conda-forge::scikit-learn
- bioconda::seqkit
- bioconda::metaphlan>=3.0.1

Ah, I now see that it’s the /ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/bin/makeblastdb child job that is causing the problems. I don’t think that it is a memory issue, but I’ll try using more.

It turns out that, for some reason, makeblastdb generates Error: mdb_env_open: Cannot allocate memory errors even when providing 300G of memory for the cluster job. The input fasta only has 38 sequences (47K file size). This memory issue for blast 2.10.1 is causing strainphlan to die.

Hi @nick-youngblut
It actually is a really strange error. Let us know if you find a solution.

Best,
Aitor

I was told by blast dev support that 2.11.0 would fix the issue, but qsub jobs with makeblastdb still throw a cannot allocate memory error, even when creating a db of just one bacterial genome and providing 20G of memory for the job.

My problem is that Strainphlan3 uses more than it’s designated number of threads (via --nproc {threads}) when run locally, such as generating 100’s of tbfast jobs that seem to spawn from raxml when it is called by strainphlan. So, I’d like to run the jobs in parallel on our cluster where the threads are limited by the job resources, but there’s the makeblastdb memory error.

So either I run up the local server to 400-500X load on an 80 thread machine or get memory error when running strainphlan on our cluster due to the makeblastdb issue.

Is there any way to provide the blastdb as input to strainphlan so that it doesn’t have to be computed during the job? Either that, or to split the job into the blast portion and the phylo inference portion?

Hi @nick-youngblut
The problem you are experimenting with the tbfast jobs is really weird. tbfast should be part of the mafft execution PhyloPhlAn calls internally, and it uses as the --thread parameter the StrainPhlAn’s --nproc. Could you please report us the PhyloPhlAn version you are using and the PhyloPhlAn configuration file StrainPhlAn is creating?

FYI: I think that I found the source of the issue with running makeblastdb on our cluster. From the blast docs:

Starting with the 2.10.0 release, makeblastdb produces version 5 databases by default, which uses LMDB. LMDB requires virtual memory (at least 600 GB, but 800 GB is recommended). Virtual memory is just that (virtual) and doesn’t depend on the hardware in your system. In general, we recommend that BLAST users simply set the virtual memory to unlimited.

I tested this, and it is true: I have to provide >= 600G for -l h_vmem in order to get makeblastdb to run successfully. However, that this the only cluster resource for memory management, so this greatly increases the wait in the job queue.

It might be good to move to something else other than blast or allow for blast database construction in a separate step from the rest of the strainphlan algorithm.

My conda env:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
bcbio-gff                 0.6.6              pyh864c0ab_1    bioconda
biom-format               2.1.10           py38h0b5ebd8_0    conda-forge
biopython                 1.78             py38h497a2fe_1    conda-forge
blast                     2.10.1          pl526he19e7b1_3    bioconda
boost-cpp                 1.70.0               h7b93d67_3    conda-forge
bowtie2                   2.4.2            py38h1c8e9b9_1    bioconda
brotlipy                  0.7.0           py38h497a2fe_1001    conda-forge
bx-python                 0.8.9            py38hb90e610_2    bioconda
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.17.1               h36c2ea0_0    conda-forge
ca-certificates           2021.1.19            h06a4308_0
cached-property           1.5.2                      py_0
capnproto                 0.6.1                hfc679d8_1    conda-forge
certifi                   2020.12.5        py38h578d9bd_1    conda-forge
cffi                      1.14.4           py38ha65f79e_1    conda-forge
chardet                   4.0.0            py38h578d9bd_1    conda-forge
click                     7.1.2              pyh9f0ad1d_0    conda-forge
cmseq                     1.0.2              pyh7b7c402_0    bioconda
cryptography              3.3.1            py38h2b97feb_1    conda-forge
curl                      7.71.1               he644dc0_8    conda-forge
cycler                    0.10.0                     py_2    conda-forge
dendropy                  4.5.1              pyh3252c3a_0    bioconda
diamond                   2.0.6                h56fc30b_0    bioconda
entrez-direct             13.9            pl526h375a9b1_0    bioconda
expat                     2.2.10               he6710b0_2
fasttree                  2.1.10               h516909a_4    bioconda
freetype                  2.10.4               h0708190_1    conda-forge
future                    0.18.2           py38h578d9bd_3    conda-forge
gsl                       2.6                  he838d99_2    conda-forge
h5py                      3.1.0           nompi_py38hafa665b_100    conda-forge
hdf5                      1.10.6          nompi_h6a2412b_1114    conda-forge
htslib                    1.11                 hd3b49d5_1    bioconda
icu                       67.1                 he1b5a44_0    conda-forge
idna                      2.10               pyh9f0ad1d_0    conda-forge
iqtree                    2.0.3                h176a8bc_1    bioconda
joblib                    1.0.0              pyhd8ed1ab_0    conda-forge
jpeg                      9d                   h516909a_0    conda-forge
kiwisolver                1.3.1            py38h1fd1430_1    conda-forge
krb5                      1.17.2               h926e7f8_0    conda-forge
lcms2                     2.11                 hcbb858e_1    conda-forge
ld_impl_linux-64          2.35.1               hed1e6ac_1    conda-forge
libblas                   3.9.0                7_openblas    conda-forge
libcblas                  3.9.0                7_openblas    conda-forge
libcurl                   7.71.1               hcdd3856_8    conda-forge
libdeflate                1.6                  h516909a_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libffi                    3.3                  h58526e2_2    conda-forge
libgcc-ng                 9.3.0               h2828fa1_18    conda-forge
libgfortran-ng            9.3.0               hff62375_18    conda-forge
libgfortran5              9.3.0               hff62375_18    conda-forge
libgomp                   9.3.0               h2828fa1_18    conda-forge
libiconv                  1.16                 h516909a_0    conda-forge
liblapack                 3.9.0                7_openblas    conda-forge
libnghttp2                1.41.0               h8cfc5f6_2    conda-forge
libopenblas               0.3.12          pthreads_h4812303_1    conda-forge
libpng                    1.6.37               hed695b0_2    conda-forge
libssh2                   1.9.0                hab1572f_5    conda-forge
libstdcxx-ng              9.3.0               h6de172a_18    conda-forge
libtiff                   4.2.0                hdc55705_0    conda-forge
libwebp-base              1.1.0                h516909a_3    conda-forge
libxml2                   2.9.10               h68273f3_2    conda-forge
lz4-c                     1.9.3                h9c3ff4c_0    conda-forge
lzo                       2.10              h516909a_1000    conda-forge
mafft                     7.475                h516909a_0    bioconda
mash                      2.2.2                ha61e061_2    bioconda
matplotlib-base           3.3.3            py38h5c7f4ab_0    conda-forge
metaphlan                 3.0.7              pyh7b7c402_0    bioconda
muscle                    3.8.1551             hc9558a2_5    bioconda
ncurses                   6.2                  h58526e2_4    conda-forge
newick_utils              1.6                  h516909a_3    bioconda
numpy                     1.19.5           py38h18fd61f_1    conda-forge
olefile                   0.46               pyh9f0ad1d_1    conda-forge
openssl                   1.1.1i               h7f98852_0    conda-forge
pandas                    1.2.0            py38h51da96c_1    conda-forge
patsy                     0.5.1                      py_0    conda-forge
pcre                      8.44                 he1b5a44_0    conda-forge
perl                      5.26.2            h36c2ea0_1008    conda-forge
perl-app-cpanminus        1.7044                  pl526_1    bioconda
perl-archive-tar          2.32                    pl526_0    bioconda
perl-base                 2.23                    pl526_1    bioconda
perl-business-isbn        3.004                   pl526_0    bioconda
perl-business-isbn-data   20140910.003            pl526_0    bioconda
perl-carp                 1.38                    pl526_3    bioconda
perl-common-sense         3.74                    pl526_2    bioconda
perl-compress-raw-bzip2   2.087           pl526he1b5a44_0    bioconda
perl-compress-raw-zlib    2.087           pl526hc9558a2_0    bioconda
perl-constant             1.33                    pl526_1    bioconda
perl-data-dumper          2.173                   pl526_0    bioconda
perl-digest-hmac          1.03                    pl526_3    bioconda
perl-digest-md5           2.55                    pl526_0    bioconda
perl-encode               2.88                    pl526_1    bioconda
perl-encode-locale        1.05                    pl526_6    bioconda
perl-exporter             5.72                    pl526_1    bioconda
perl-exporter-tiny        1.002001                pl526_0    bioconda
perl-extutils-makemaker   7.36                    pl526_1    bioconda
perl-file-listing         6.04                    pl526_1    bioconda
perl-file-path            2.16                    pl526_0    bioconda
perl-file-temp            0.2304                  pl526_2    bioconda
perl-html-parser          3.72            pl526h6bb024c_5    bioconda
perl-html-tagset          3.20                    pl526_3    bioconda
perl-html-tree            5.07                    pl526_1    bioconda
perl-http-cookies         6.04                    pl526_0    bioconda
perl-http-daemon          6.01                    pl526_1    bioconda
perl-http-date            6.02                    pl526_3    bioconda
perl-http-message         6.18                    pl526_0    bioconda
perl-http-negotiate       6.01                    pl526_3    bioconda
perl-io-compress          2.087           pl526he1b5a44_0    bioconda
perl-io-html              1.001                   pl526_2    bioconda
perl-io-socket-ssl        2.066                   pl526_0    bioconda
perl-io-zlib              1.10                    pl526_2    bioconda
perl-json                 4.02                    pl526_0    bioconda
perl-json-xs              2.34            pl526h6bb024c_3    bioconda
perl-libwww-perl          6.39                    pl526_0    bioconda
perl-list-moreutils       0.428                   pl526_1    bioconda
perl-list-moreutils-xs    0.428                   pl526_0    bioconda
perl-lwp-mediatypes       6.04                    pl526_0    bioconda
perl-lwp-protocol-https   6.07                    pl526_4    bioconda
perl-mime-base64          3.15                    pl526_1    bioconda
perl-mozilla-ca           20180117                pl526_1    bioconda
perl-net-http             6.19                    pl526_0    bioconda
perl-net-ssleay           1.88            pl526h90d6eec_0    bioconda
perl-ntlm                 1.09                    pl526_4    bioconda
perl-parent               0.236                   pl526_1    bioconda
perl-pathtools            3.75            pl526h14c3975_1    bioconda
perl-scalar-list-utils    1.52            pl526h516909a_0    bioconda
perl-socket               2.027                   pl526_1    bioconda
perl-storable             3.15            pl526h14c3975_0    bioconda
perl-test-requiresinternet 0.05                    pl526_0    bioconda
perl-time-local           1.28                    pl526_1    bioconda
perl-try-tiny             0.30                    pl526_1    bioconda
perl-types-serialiser     1.0                     pl526_2    bioconda
perl-uri                  1.76                    pl526_0    bioconda
perl-www-robotrules       6.02                    pl526_3    bioconda
perl-xml-namespacesupport 1.12                    pl526_0    bioconda
perl-xml-parser           2.44_01         pl526ha1d75be_1002    conda-forge
perl-xml-sax              1.02                    pl526_0    bioconda
perl-xml-sax-base         1.09                    pl526_0    bioconda
perl-xml-sax-expat        0.51                    pl526_3    bioconda
perl-xml-simple           2.25                    pl526_1    bioconda
perl-xsloader             0.24                    pl526_0    bioconda
phylophlan                3.0.1                      py_0    bioconda
pigz                      2.4                  h84994c4_0
pillow                    8.1.0            py38h357d4e7_1    conda-forge
pip                       20.3.3             pyhd8ed1ab_0    conda-forge
pycparser                 2.20               pyh9f0ad1d_2    conda-forge
pyopenssl                 20.0.1             pyhd8ed1ab_0    conda-forge
pyparsing                 2.4.7              pyh9f0ad1d_0    conda-forge
pysam                     0.16.0.1         py38hbdc2ae9_1    bioconda
pysocks                   1.7.1            py38h578d9bd_3    conda-forge
python                    3.8.6           hffdb5ce_4_cpython    conda-forge
python-dateutil           2.8.1                      py_0    conda-forge
python-lzo                1.12            py38h86e1cee_1003    conda-forge
python_abi                3.8                      1_cp38    conda-forge
pytz                      2020.5             pyhd8ed1ab_0    conda-forge
raxml                     8.2.12               h516909a_2    bioconda
readline                  8.0                  he28a2e2_2    conda-forge
requests                  2.25.1             pyhd3deb0d_0    conda-forge
samtools                  1.11                 h6270b1f_0    bioconda
scikit-learn              0.24.0           py38h658cfdd_0    conda-forge
scipy                     1.6.0            py38hb2138dd_0    conda-forge
seaborn                   0.11.1               ha770c72_0    conda-forge
seaborn-base              0.11.1             pyhd8ed1ab_0    conda-forge
seqkit                    0.15.0                        0    bioconda
setuptools                51.1.2           py38h06a4308_4
six                       1.15.0             pyh9f0ad1d_0    conda-forge
sqlite                    3.34.0               h74cdb3f_0    conda-forge
statsmodels               0.12.1           py38h5c078b8_2    conda-forge
tbb                       2020.3               hfd86e86_0
threadpoolctl             2.1.0              pyh5ca1d4c_0    conda-forge
tk                        8.6.10               hed695b0_1    conda-forge
tornado                   6.1              py38h497a2fe_1    conda-forge
trimal                    1.4.1                hc9558a2_4    bioconda
urllib3                   1.26.2             pyhd8ed1ab_0    conda-forge
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
zlib                      1.2.11            h516909a_1010    conda-forge
zstd                      1.4.8                ha95c52a_1    conda-forge

An example phylophlan config:

[db_dna]
program_name = /ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/bin/makeblastdb
params = -parse_seqids -dbtype nucl
input = -in
output = -out
version = -version
command_line = #program_name# #params# #input# #output#

[map_dna]
program_name = /ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/bin/blastn
params = -outfmt 6 -evalue 0.1 -max_target_seqs 1000000 -perc_identity 75
input = -query
database = -db
output = -out
version = -version
command_line = #program_name# #params# #input# #database# #output#

[msa]
program_name = /ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/bin/mafft
params = --quiet --anysymbol --thread 1 --auto
version = --version
command_line = #program_name# #params# #input# > #output#
environment = TMPDIR=/tmp

[trim]
program_name = /ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/bin/trimal
params = -gappyout
input = -in
output = -out
version = --version
command_line = #program_name# #params# #input# #output#

[tree1]
program_name = /ebio/abt3_projects/software/dev/ll_pipelines/llmgps/.snakemake/conda/5e96ed0a/bin/raxmlHPC-PTHREADS-SSE3
params = -p 1989 -m GTRGAMMA -N 100 -f a -x 14341
input = -s
output_path = -w
output = -n
version = -v
command_line = #program_name# #params# #threads# #output_path# #input# #output#
threads = -T

Hi @nick-youngblut
I see, yes, for a cluster architecture 600GB of virtual memory will slow down all your processing. I would say you can explore two possibilities:

  1. Install an older BLAST version, I’m currently using the version 2.9.0 for my own analyses.
  2. Provide a custom phylophlan config file that uses DIAMOND instead of BLAST.

For the mafft jobs problem, it is really weird, we never got reported something similar before (and the conf. file looks fine). Recently we updated the PhyloPhlAn conda package to the version 3.0.2, could you please try if you get the same problem with the last fixes?

DIAMOND doesn’t work for nucleotide-nucleotide searches, correct? So, it won’t work for DIAMOND, right?

In regards to the older BLAST version, this is a short-term solution, unless you will always support older versions of BLAST

A recent cluster job with strainphlan didn’t complete in 12 hours, but since I had to specify 600G of vmem, the cumulative memory usage of the job was >160 Tb

Hi @nick-youngblut
You are right, DIAMOND will only work with amynoacid or translated nucleotide sequences. My fault.
Right now, the usage of an older BLAST version seems to be then the best solution. I will get in contact with the PhyloPhlAn developers to better assess this version problem. We will let you know

OK. Thanks for confirming. I didn’t think that Benjamin had added nucleotide-nucleotide searches to diamond. He’s more concerned about scaling the existing diamond algorithms.

I’ve contacted BLAST support about the memory issue to see if they are planning on changing it in the (near) future, but I haven’t heard back.