Metawibele preprocess fail

We installed your metawibele by pulling image from Docker, and found the step of “annotation” step (software: prokka) frequently failed.
I think that it might be something wrong with the prokka version, but we cannot change the stuff in your docker image. Also we found that the prokka command, the parameter –metagenome, will cause some errors. We don’t know how to deal with that kind of issues, please let us know how to do without those fatal errors.

BTW, we also tried to install your software from conda, but fail(conda install -c biobakery metawibele python=2.7.1). Consequently, there might be something wrong with the environment conflicts??

thank you!!

1 Like

Hi there,

Re1.: It looks like an issue arising from the license requirement by tbl2asn used by prokka. tbl2asn expires every 6-12 methods and needs to be re-downloaded from NCBI (https://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/tbl2asn/) and reinstalled in Docker image.
For example, you can replace the older version by:

$ wget -O tbl2asn.gz ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/tbl2asn/linux64.tbl2asn.gz
$ gunzip tbl2asn.gz
$ chmod +x tbl2asn
$ cp tbl2asn /usr/local/bin/

Re2.: MetaWIBELE is built on Python3 and has been tested with Python 3.6 and 3.7. Please try Python 3.6+ to install the conda packages.

We recommend using the latest version of MetaWIBELE (v0.4.2). It includes a couple of updates (documented in the manual and tutorial in GitHub - biobakery/metawibele: MetaWIBELE: Workflow to Identify novel Bioactive Elements in microbiome) and has been released to pip and conda already. The new Docker image will come out soon as well. Thanks for your interest in MetaWIBELE.

Best,
Yancong

1 Like

Hello :wave:,
Thanks for your detailed reply :smiley:, and I will try to reinstall your tools in these days.

Looking forward to your new version docker image too!

Sorry to bother you again.
I recently tried on the preprocess part of your tool metawibele. Prokka could work smoothly, but I faced the different failure in task 41/43 ?
Do you have any idea about these messages?
PS. I have checked that the required files are present in the according path, so it might be memory issue or something else ?
Thank you in advance.

(Nov 17 13:35:21) [30/48 - 62.50%] **Failed ** Task 41: combine_gene_sequences
Task 41 failed
Name: combine_gene_sequences
Original error:
Error executing action 0. Original Exception:
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/anadama2/runners.py”, line 201, in _run_task_locally
action_func(task)
File “/usr/local/lib/python3.6/dist-packages/anadama2/helpers.py”, line 89, in actually_sh
ret = _sh(s, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/anadama2/util/init.py”, line 320, in sh
raise ShellException(proc.returncode, msg.format(cmd, ret[0], ret[1]))
anadama2.util.ShellException: [Errno 1] Command `metawibele_combine_gene_sequences -p /home/batch_9/metawibele-out/gene_annotation/ -e ffn -o /home/batch_9/metawibele-out/b9_combined_gene.fna > /home/batch_9/metawibele-out/b9_combined_gene.log 2>&1 ’ failed.
Out: b’’
Err: b’’

(Nov 17 13:35:21) [32/48 - 66.67%] **Ready ** Task 43: format_protein_sequences
Task 43 failed
Name: format_protein_sequences
Original error:
Error executing action 0. Original Exception:
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/anadama2/runners.py”, line 201, in _run_task_locally
action_func(task)
File “/usr/local/lib/python3.6/dist-packages/anadama2/helpers.py”, line 89, in actually_sh
ret = _sh(s, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/anadama2/util/init.py”, line 320, in sh
raise ShellException(proc.returncode, msg.format(cmd, ret[0], ret[1]))
anadama2.util.ShellException: [Errno 1] Command `metawibele_format_protein_sequences -p /home/batch_9/metawibele-out/gene_annotation/ -q /home/batch_9/metawibele-out/gene_calls/ -e faa -o /home/batch_9/metawibele-out/b9_combined_protein.faa -m /home/batch_9/metawibele-out/b9_gene_info.tsv >/home/batch_9/metawibele-out/b9_combined_protein.log 2>&1 ’ failed.
Out: b’’
Err: b’’

In addition to the main log file for recording the running status of all tasks, MetaWIBELE also reports the detailed processing of each task into an individual log file which could help us do further debugging when needed. So for task 41, you could check its specific log file ( /home/batch_9/metawibele-out/b9_combined_gene.log) and see whether there are more messages there. Similarly, for task 43, check /home/batch_9/metawibele-out/b9_combined_protein.log.

To check whether these failures were caused by limited memory, one quick easy way is to test MetaWIBELE using a small dataset (e.g. the demo data in the tutorial: metawibele · biobakery/biobakery Wiki · GitHub) and see whether this small set can be run successfully on your system. Feel free to let me know if you have any further questions and I am happy to help.

Best,
Yancong

Dear YancongZhang,
I have chenck the file b9_combined_protein.log, and error messages are as follows:

Start format_protein_sequences.py -p /home/batch_9/metawibele-out/gene_annotation/

Get sequence info …starting
Traceback (most recent call last):
File “/usr/local/bin/metawibele_format_protein_sequences”, line 11, in
sys.exit(main())
File “/usr/local/lib/python3.6/dist-packages/metawibele/tools/format_protein_sequences.py”, line 296, in main
gff, types, partial = collect_sequence (values.p, values.e, values.q, values.o)
File “/usr/local/lib/python3.6/dist-packages/metawibele/tools/format_protein_sequences.py”, line 100, in collect_sequence
myid = myid.group(1)
AttributeError: ‘NoneType’ object has no attribute ‘group

Thank you in advance!

Thanks for sharing the detailed log info! It looks like that you are using a bit old version of MetaWIBELE (I am guessing v0.3.7?). We have fixed several issues since then. I think the error you pointed out here should be fixed in newer versions. Could you try our latest released v0.4.2 instead? It is available via pip and conda now.

Best,
Yancong

Dear YancongZhang,

I installed the required environment in your docker image, and I will tried to update the version via pip and conda.

Thank you :blush:

取得 Android 版 Outlook

Dear YancongZhang,
I recently tried to install metawibele (preprocess.py v0.4.4) in docker environment.
The figure shows that the .gff file is not in the location…
The exact location might be at /usr/data/batch_16/metawibele_out/gene_calls/B15AMD0386/B15AMD0386.gff.
Do you have any idea about the error message ?

Another question is can you guide me how to install the psortb dependency ? I don’t know why I find it very hard to install it properly.

Thank you in advance.

Hi there,

Thanks for using MetaWIBELE v0.4.4 released in Docker (Docker Hub). If you run the whole preprocessing workflow from start to end, there should be a soft link of B15AMD0386.gff made in the folder /usr/data/batch_16/metawibele_out/gene_calls/. If you didn’t find this soft link, a couple of potential reasons:

  1. If you de novo ran the preprocessing workflow and failed in this step, some earlier steps before “format_protein_sequences” may fail. In this case, you can check the log files to see whether some inputs and outputs are disordered;
  2. If you just resumed the preprocessing workflow based on a previous unfinished one, you should ensure that earlier steps before “format_protein_sequences” are successfully finished and the dependent inputs are generated, e.g. check whether the corresponding soft links are in /usr/data/batch_16/metawibele_out/gene_calls/. If not, you could either make these soft links manually, or rerun the workflow from the beginning to generate them.

For the installment of PSROTb, according to the latest guidance on their website (PSORTb Downloads Page), they only encourage to use the Docker version so far. The command-line version seems complex to install. You should install the prerequisites when compiling the command-line version locally based on their installment guidance. It would take some efforts. Hope the new version will be easier to install when it comes out. So far, we make PSORTb command-line version as one optional dependency in MetaIWBELE. That means, you can definitely run MetaWIBELE without PSORTb by setting “--bypass-psortb” parameter, e.g. “metawibele characterize --input-sequence $INPUT_SEQUENCE --input-count $INPUT_COUNT --input-metadata $INPUT_METADATA --output $OUTPUT_DIR --bypass-psortb”. This will generate the most important and essential outputs from MetaWIBELE.

Thanks!
Yancong

Hi Yancong,
Thanks for your prompt response !
I tried the second way you mentioned to manually make the soft links, and the whole preprocessing work ran smoothly.

In the part of PSROTb, I will omit this step to generate the essential outputs from your software.

Another question is whether the input files need to be trimmed the adaptor and filtered the human genome before run your preprocessing pipeline ?

Thanks ~~
Yuzie

Hi Yuzie,

Sounds great. Thanks for updating me on this progress.

For the question on preparing input fastq files, the short answer is yes. MetaWIBELE’s preprocessing workflow uses well quality-controlled reads as inputs. Thus, you should do some quality-control work on raw sequencing reads before running the preprocessing pipeline. Lots of tools can do such work. For example, the KneadData pipeline (KneadData – The Huttenhower Lab) can be used for filtering low-quality reads, removing repetitive sequences, and controlling contamination from human genome and rRNA.

Best,
Yancong

Hello Yancong,
May I ask a simple question ?
In the characterize module, the outputs can be divided into two groups, one is supervised another is unsupervised. Whether the unsupervised outcome is related to the metadata ? That is to say, if I only want to see the unsupervised gene and taxonomy constitution, which is possible or not ?

Sorry for bugging you again,
Yuzie

Hi Yuzie,

Happy to help! The unsupervised module works without requiring information on metadata (e.g. host phenotypes). This approach assumes that common genes are likely to be functional, regardless of their association with environmental/phenotypic parameters. If you keep the “phenotype” parameter as its default setting (“phenotype = none”) in MetaWIBELE’s global configuration file (e.g. metawibele.cfg), then only the unsupervised module will be run. For more details about MetaWIBELE’s parameter setting, feel free to check the manual: GitHub - biobakery/metawibele: MetaWIBELE: Workflow to Identify novel Bioactive Elements in microbiome

Thanks!
Yancong

Hi there,
when I used the characterization module, I confronted with an error message.

03/21/2022 07:22:48 AM - metawibele.config - INFO: ### Start uniref_protein step ####
03/21/2022 07:22:48 AM - metawibele.config - INFO: Get UniRef DB and annotation info …starting
Traceback (most recent call last):
File “/usr/local/bin/metawibele_uniref_protein”, line 11, in
sys.exit(main())
File “/usr/local/lib/python3.6/dist-packages/metawibele/characterize/uniref_protein.py”, line 323, in main
pfam = collect_pfam_info (config.pfam_database)
File “/usr/local/lib/python3.6/dist-packages/metawibele/characterize/uniref_protein.py”, line 92, in collect_pfam_info
for line in utils.gzip_bzip2_biom_open_readlines (pfamfile):
File “/usr/local/lib/python3.6/dist-packages/metawibele/common/utils.py”, line 329, in gzip_bzip2_biom_open_readlines
for line in file_handle:
File “/usr/lib/python3.6/encodings/ascii.py”, line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xe2 in position 5159: ordinal not in range(128)

Do you have any idea to solve this error ?

Thanks !
Yuzie

Hi @YancongZhang,
Now, we have near 1000 shotgun samples to run your pipeline.
However, based on limited computational resources, we can only run in batches (150 samples per run).
And we want to merge the outcome from each batch, do you know how to do ? I have learned that the binner MSPminer you are using might have advantage on this situation but we don’t know how to do.

Thanks for all of your help,
Yuzie

Hi Yuzie,

It looks like the Pfam database was not well recognized in your environment. By testing the default MetaIWBELE docker image (v0.4.4), it works well on my end when using the default Pfam DB that is wrapped into MetaIWBELE. Did you use your own built Pfam DB through customizing the “domain_db” option in MetaWIBELE’s global configuration file (metawibele.cfg)? If so, please check whether your customized DB is well generated. You may want to get more information about setting MetaIWBELE dependent databases from the online manual, e.g. which is required to be customized (e.g. uniref DB), which is optional (e.g. pfam DB), and how to customize, etc. Additionally, is it possible that you locally made some configurations in your docker container environment? You could check whether there is some difference between your container with the default one, whether some changes influenced the file reading, whether the installed Pfam DB packaged in MetaWIBELE is accessible, etc.

Best,
Yancong

Hi Yuzie,

The main modules of MetaWIBELE (MetaWIBELE-characterize and MetaWIBELE-prioritize) take the abundance table of gene catalogs as input (a gene-over-sample matrix). In theory, this abundance table should include all samples from the same dataset. To prepare this type of input, you may want to use the utility workflow in the MetaWIBELE package for preprocessing metagenomes reads. Due to the resource limit In your case, one potential solution could be:

  1. firstly you run the preprocessing workflow in batches for assembly and gene calling without building gene catalogs via --bypass-gene-catalog);
  2. After you successfully finished all batches, correspondingly merged the files in the “finalized” folder from each batch, and organize merged them into the “finalized” folder for all samples;
  3. Make sure all the files of your assemblies and gene calls for all samples are well organized and could be recognized by the workflow (you can assume that you run all samples at once and mimic the same organization of the folders and files). Then rerun the preprocessing workflow to build the gene catalogs across all samples with parameter setting “--bypass-assembly --bypass-gene-calling”.

In this way, you will get an abundance table including all samples and pass it to MetaWIEBEL for characterization and prioritization.

Hope this will hope.

Best,
Yancong

Hello :hugs: Yancong,

Thanks for your suggestions, and we will try to merge our samples with your helpful tool!

Judging from the Pfam db we used was from your docker environment, and we didn’t make any changes on domain_db.

取得 Android 版 Outlook

Hi there,
I have used your tool with your testing data, and went smoothly. (preprocess, characterize and prioritize)
However, when I used our data, the characterize step will output the error in characterize/global_homology_annotation/uniref90_protein.log.
Also I have check the files below, and I didn’t find any strange.
/usr/local/bin/metawibele_uniref_protein
/usr/local/lib/python3.6/dist-packages/metawibele/characterize/uniref_protein.py
/usr/local/lib/python3.6/dist-packages/metawibele/common/utils.py
the error is as follow (same as before):


So I though that there might be something wrong with my input data prepared from your preprocess step. The image below show that the contig numbers in batch_22_genecatalogs.centroid.faa and batch_ 22_genecatalogs_counts.all.tsv are not the same. Is it might be the reason that my characterize step kept crushing ? (because I noticed that the contig numbers in your two test files are the same)

Thanks in advance,
Yuzie