Workflow Stalling on SGE

Hello!
I’m attempting to make use of the wmgx_wmtx workflow but have run into an issue with grid computing. I’m working on an SGE and planned to use 64 cores. From the log files, it looks like the first 64 jobs dispatched just fine! The correct Kneaddata output files have been generated, the .rc files in the sge_files directory all show exit statuses of 0, and running qacct also shows that each of the 64 jobs completed with exit_status 0. Unfortunately, it looks like the next batch of jobs never got sent out. From the anadama.log file, it looks get_queue_status was only called once, roughly when the last of the 64 jobs were finishing up. It has been about 16 hours since then.

I have had no errors, and the parent process is still running. I would appreciate help figuring out why the second batch of jobs never got submitted.

Thanks,
Mark

Hello Mark, Thank you for the detailed post. Do you see any stdout from the parent process? If not, it likely ran into an error which should have been printed to stderr at some point. If you are capturing all stdout/stderr it would be great to know what the error is. It is likely that the jobs were recorded as completed in the database and so if you stop and restart the workflow it should pick up by submitting the next set of jobs. If so if you would capture stdout/stderr for this next run it would be great as it will help us debug the error if if occurs again with the next queue status call.

Thank you,
Lauren

Hey Lauren, thanks for getting back to me. My stdout and stderr files are empty! I checked my script and re-ran it over the weekend to be sure. If it helps, my log file ends with records like this:
2021-07-24 01:18:58,332 root submit_grid_job INFO: Submitted job for task id 193: grid id 3461576 2021-07-24 01:18:58,354 LoggerReporter log_event INFO: task 193, kneaddata____DNA_UMB06_07- : grid job id 3461576 has status Submitted

Thanks for your help. Let me know if you would like to look at anything else.

Mark

Hi Mark, Thank you for trying again and for the additional information. I think it might be an error from the portion of the SGE code that looks for the username to determine the queue status. Would you try running the following?

  1. Command that should error on your system (if I have the right section of the code that is failing)
    python -c "import pwd, os; print(pwd.getpwuid(os.getuid()))[0]"

  2. Command that should work okay to get the username
    python -c "import getpass; print(getpass.getuser())"

If the first fails and the second works okay then I can make the change in the code and tag a new release. If so this was just a portion of the code that we missed updating in our python2 to python3 conversion (as it works in python2 but fails in python3). Sorry we missed that line.

Thank you,
Lauren

Hi Lauren, you’re right; the first command fails and the second command runs. Additionally, the first command works if you index the output of getpwuid instead of the print. Happy to help find bugs!

Thanks,
Mark

Hi Lauren, I looked at my package versions and basically all components of the workflow were out of date. Anaconda and pip do not give the most recent versions of biobakery tools by default. I used pip install git+[url]@master to get current versions and now everything seems to be working correctly.

Mark

Hi Mark, That is great. Thank you for the follow up. I am glad to hear you are set. Please post if you run into any other issues.

Thank you,
Lauren