There are at least a few use case for scheduled jobs including,
- Computing rollups
- Purging data (once we stop see using TTL, see HWKMETRICS-191 for details)
- Deleting tenants or metrics
Quartz is a popular job execution framework which we have used in the past in RHQ. We do not want to use Quartz though as it requires an RDBMS. Ideally, the solution will introduce no new external dependencies, relying only Cassandra.
The solution needs to provide durability. For example, if I submit a request to execute a job in ten minutes and if the server is shut down in two minutes, I should not have to resubmit the job when the server comes back up. The solution also needs to be able to store any parameters passed to a job as well as any intermediate results that a job might store.
The solution needs to be scalable. Suppose (for simplicity) that the server executes one job at a time and that we have 10 recurring jobs to execute every hour. One of the jobs is very slow causing execution of remaining jobs to fall way behind schedule. We should be able to simply start up additional servers that will automatically start executing the other jobs.
The solution needs to be fault tolerant. Suppose we have servers A, B, and C and a job running on server A which goes down. Either B or C should automatically resume execution of the job. If we are running only single server, A, and it goes down, it should be capable of resuming execution of the job once it comes back up. Because of the durability requirements, there should be a way to resume a job from the point it left off without having to redo work that has already been done.
Let me talk a little bit about what the solution does not need to be. It does not need to be a high performance, distributed compute engine. If there is a need for that, we should look to something like Spark or possibly Infinispan. The solution however could be complimentary with something like Spark. For example, we might schedule a recurring job that kicks off some computationally intensive work on a Spark cluster.
The solution does not need to provide super, low latency job execution. I do not think it is necessary to support jobs that need to execute every millisecond or even every second. The goal here is asynchronous, batch execution, not real time job processing. Furthermore, I do not think solution needs to provide really strong guarantees around execution times. We do not need to guarantee for example that a job will start with 10 milliseconds of it scheduled time. With that said, we should be able to offer some reasonable, minimum guarantees around execution times.