Certain failure situations can prevent jobs from getting rescheduled for subsequent executions. While testing the compression job, I was seeing this happen. The job would run and cause a spike in GC activity in Cassandra. This result in some job scheduler queries failing. I then discovered that the job was no longer running. There are some problems with the post-job execution steps.
In SchedulerImpl.executeJob(), setJobFinished() is called after the job executes.This method simply updates the finished_jobs_idx table. If that query fails, then the job will effectively be abandoned. SchedulerImpl.start() queries for jobs to be executed which can include didn't previously complete for one reason or another. The queries that determine eligible jobs for a particular time slice exclude those jobs that are in finished_jobs_idx. This is one problem.
We get into a another, similar situation if we fail in the reschedule() method. Assume for the moment that when looking for eligible jobs we do not ignore jobs in finished_jobs_idx. SchedulerImpl also maintains an in-memory cache of active job ids for those jobs running locally. When looking for eligible jobs to execute, we exclude those that are in the active jobs cache. The job id is removed from the cache in the deactivate() method, which will be skipped if reschedule() fails.