While testing the compression job Cassandra hit an OutOfMemoryError during a couple test runs which led to the JVM process terminating. After restarting Cassandra the compression job was no no longer running. There are two tables involved with searching for jobs to execute - scheduled_jobs_idx and active_time_slices. When the job scheduler wakes up it first stores the current minute in which it is running in active_time_slices. It then queries active_time_slices to determine which partitions in scheduled_jobs_idx should be queried. A timestamp is removed from active_time_slices only after all jobs for that time have finished. In both cases in which Cassandra crashed the compression job did not finish, yet the timestamp was not in active_time_slices. There was no indication in the debug logs that the timestamp had been removed from active_time_slices.
I wound up restarting the hawkular metrics server do to another, related issue. The job scheduler caches in memory the time slices for which it is currently running jobs. When it wakes up and queries active_time_slices it filters out those timestamps in the cache. The compression job failed and the cache was not updated, so I needed to restart the server. And because active_time_slices did not have the timestamp, the job scheduler would not query the older partition in scheduled_jobs_idx, which resulted the compression job no longer being executed.
The timestamp was missing from active_time_slices because there was data loss in Cassandra. By default the commit log is sync'd to disk every 10 seconds which means that there is a 10 second window for data loss when Cassandra is killed abruptly. The sync interval can be changed to be more frequent but at the expensive of slowing down writes.
In a multi-node cluster this generally wouldn't be as much of an issue unless all replicas for a write suddenly go down at about the same time. It is possible though.
I think we avoid this problem simply by only querying scheduled_jobs_idx to find jobs to execute. We could find a job in two rows in different partitions in scheduled_jobs_idx. If that happens, we know we want to execute the job for the earlier time. The earlier job execution in this case already completed and the job scheduler won't execute it again. We need to let the job scheduler run for that earlier time though so it can finish its post-execution clean up steps.