The CloudBolt Job Engine

Introduction

CloudBolt includes many features that need to be executed in the background, separate from web requests made to CloudBolt’s UI or API. Those processes include orders, actions, rules, and various recurring jobs. The progress and results of those processes are tracked as Jobs, and the Job Engine is how CloudBolt executes them.

The Job Engine is designed to be resilient against most types of errors, including exceptions raised by plug-ins, external libraries, and database queries. However, it is occasionally necessary to troubleshoot how Jobs are processing. This document is designed to help discover the state of the Job Engine and serve as a guide to gather the information you might need when opening a support ticket.

What is the normal progression of job statuses?

  1. PENDING
  2. QUEUED
  3. RUNNING
  4. If canceled:
    • TO_CANCEL
    • CANCELED
  5. If job completes: (for actions, these get determined by what is returned by the action)
    • SUCCESS -or- WARNING -or- FAILURE

How does each job get processed by the job engine?

  1. By default, jobs are created with a PENDING status.
  2. If a job is scheduled for the future, the workers will ignore the job until its scheduled time has passed.
  3. For Recurring Jobs, the workers check their schedules every minute and will generate a PENDING job if appropriate.
  4. Job engine workers check for PENDING jobs once a second and set their statuses to QUEUED.
  5. The worker immediately spawns a new thread for each QUEUED job and changes the job’s status to RUNNING.
  6. If the ‘Cancel Job’ button is clicked in the UI while the job is running, the job is set to TO_CANCEL, an Exception
    is raised inside the job’s thread. A message is added to the job’s log, the job’s status is set to CANCELED, and the job stops running.
  7. Otherwise, the job’s run() method completes and returns results. Depending on the results, the job status gets set
    to SUCCESS, WARNING, or FAILURE.

What can go wrong with my job engine?

  • There could be problems in the underlying database.
  • Recurring or scheduled jobs might never get queued because a worker process is not running each minute, e.g. if
    you have disabled or modified the cron job on your CloudBolt server
  • Running workers could die before completing a job, leaving its status as QUEUED or RUNNING
  • A job can run infinitely, then the job engine will never exit
    • This can be caused by infinite loops or deadlocking issues within your scripts or jobs

How do I troubleshoot my job engine?

Answer the following questions to understand the scope of your issues:

  • Which job or order cause you to notice the problem? Note its job number, status, and URL.
  • Which related jobs are having problems? Note their job numbers, status, and URLs.
  • Which job(s) started running most recently? Note their job numbers, status, and URLs.
  • Have other jobs started running after the jobs you noted above?
  • Are some jobs in a different status from others? Note this and review the above descriptions of the sequence of how the
    job engine runs to understand which job was the last job to run.
  • Are your jobs in a PENDING state? This indicates that the jobs have not been added to the queue, which might be because
    the job is scheduled to run in the future.
  • Are your jobs in a QUEUED state? This indicates the job engine has identified it should run a job but has not started
    running it yet. Check whether a job engine worker is running via ps aux | grep jobengine.
  • Are your jobs in a RUNNING state? This indicates the job engine has started running the job, and is either still running
    the job correctly or the job engine has crashed. Check the job engine’s service status, and if it is running as expected, review the job’s log to see what its last step was and whether there are any error messages.
  • Are your jobs in a CANCELED state? This indicates the job engine has canceled those jobs, and should be available to run
    future jobs. Try running a new job—if it runs correctly, there is no problem: your job engine is behaving as designed.
  • Are your jobs in a FAILURE, WARNING, or SUCCESS state? Your jobs completed running and returned this status, so the job
    engine is working properly. Troubleshoot the underlying job for any failures you are seeing.
  • If the above steps do not help, create a support ticket and describe which of the above steps you took as well as the
    specific job numbers (with your expected and their actual status for each), and attach your application.log, jobengine.log, and each job’s individual log file, which you can download from the job’s detail page. Provide screenshots of how the jobs appear in the UI, as well as the output from running ps aux | grep jobengine.

How do I manually restart the job engine?

First, be aware that the currently running jobs will be killed and stuck in RUNNING state.

To manually restart the job engine, you can kill the jobengine.py processes and let cron restart the job engine workers:

pkill -f jobengine

Cron should then restart a jobengine.py process within a minute.