The Job Engine¶
Failures in the CloudBolt Job Engine are rare, and it is designed to be resilient against most types of provider-specific failures. However, user-written actions have the potential to cause the job engine to have issues and you may need to troubleshoot these issues.
What is the normal progression of job statuses?¶
- If canceled:
- If job completes: (for actions, these get determined by what is returned by the action)
- SUCCESS -or- WARNING -or- FAILURE
How does each job get processed by the job engine?¶
- By default, jobs are created with a PENDING status. Running a job via the UI will immediately add it to the job queue (RabbitMQ), and sets it status to QUEUED.
- For Recurring Jobs or jobs scheduled for the future, a root cron job runs once a minute and adds any PENDING jobs to the job queue (RabbitMQ) and sets their status to QUEUED.
- A job engine worker (Celery) sees the QUEUED job in the queue and claims it
- The worker spawns a new thread and changes the job’s status to RUNNING
- If the ‘Cancel Job’ button is clicked in the UI while the job is running, the job is set to CANCELED, an Exception is raised inside the job’s thread, and the job stops running.
- Otherwise, the job’s run() method completes and returns results. Depending on the results, the job status gets set to SUCCESS, WARNING, or FAILURE.
- If a job engine worker dies while processing tasks, the jobs will be either canceled or re-queued, depending on how far the job got before the worker died. This is done automatically via a cron job, but that command can be run manually via /opt/cloudbolt/manage.py revise_job_statuses.
- Note: RUNNING jobs that have begun to execute will be marked as CANCELED, while RUNNING jobs that have not yet executed will be re-queued. QUEUED jobs will get re-queued, but only when the worker queue is empty, or the job has been QUEUED for over an hour. That timeout can be adjusted by using a –timeout flag for the management command. For example, to requeue any queued jobs older than 5 minutes, you could run /opt/cloudbolt/manage.py revise_job_statuses –timeout 5.
What can go wrong with my job engine?¶
- There could be problems in the underlying database.
- Recurring or scheduled jobs might never get queued because the cron job is not running each minute, e.g. if you have disabled or modified the cron job on your CloudBolt server
- The RabbitMQ queue might stop running
- The Celery workers might stop running
- Running workers could die before completing a job, leaving its status as QUEUED or RUNNING
- Normally, a cron job will re-queue or cancel these jobs, but that could have been disabled. You can run the command manually via /opt/cloudbolt/manage.py revise_job_statuses
- A job can run infinitely, then the job engine will never exit
- This can be caused by infinite loops or deadlocking issues within your scripts or jobs
How do I troubleshoot my job engine?¶
Answer the following questions to understand the scope of your issues:
- Which job or order cause you to notice the problem? Note its job number, status, and URL.
- Which related jobs are having problems? Note their job numbers, status, and URLs.
- Which job(s) started running most recently? Note their job numbers, status, and URLs.
- Have other jobs started running after the jobs you noted above?
- Are some jobs in a different status from others? Note this and review the above descriptions of the sequence of how the job engine runs to understand which job was the last job to run.
- Are your jobs in a PENDING state? This indicates that the jobs have not been added to the queue, which might be because the job is scheduled to run in the future.
- Are your jobs in a QUEUED state? This indicates the job engine has identified it should run a job but has not started running it yet. Check whether the job engine is operating properly by running service rabbitmq-server status and supervisorctl status celeryd:*.
- Are your jobs in a RUNNING state? This indicates the job engine has started running the job, and is either still running the job correctly or the job engine has crashed. Check the job engine’s service status, and if it is running as expected, review the job’s log to see what its last step was and whether there are any error messages.
- Are your jobs in a CANCELED state? This indicates the job engine has canceled those jobs, and should be available to run future jobs. Try running a new job—if it runs correctly, there is no problem: your job engine is behaving as designed.
- Are your jobs in a FAILURE, WARNING, or SUCCESS state? Your jobs completed running and returned this status, so the job engine is working properly. Troubleshoot the underlying job for any failures you are seeing.
- If the above steps do not help, create a support ticket and describe which of the above steps you took as well as the specific job numbers (with your expected and their actual status for each), and attach your application.log, jobengine.log, and each job’s individual log file, which you can download from the job’s detail page. Provide screenshots of how the jobs appear in the UI, as well as the output from running service rabbitmq-server status and supervisorctl status celeryd:*.
How do I restart the job engine?¶
To restart the job engine workers, first check the status:
If it returns an error or says any of the workers are not running, restart the service:
supervisorctl restart celeryd:*
Only restart the job engine’s queue if told to do so and you are confident that it will resolve the issue. To restart the queue, first check the status:
service rabbitmq-server status
If it returns an error or says the service is not running, restart the service:
service rabbitmq-server restart
Any jobs that were QUEUED should be re-queued automatically and will start to run when the job engine is back to running. Any jobs that were RUNNING will be marked as CANCELED and will need to be re-started. The easiest method is to duplicate and re-submit the job via the UI. However, manually resetting a job’s status to ‘PENDING’ should force the job engine to run the job.