Before executing a job the scheduler acquires the job lock for the job. Each lock has a TTL that is currently hard coded to one hour. We have already seen situations where the compression job takes over an hour to complete, and we wind up with a bunch of exceptions in the log as reported in
Eliminating noise in the logs is not the primary reason for renewing locks. It is to help ensure only a single instance of a job executes at any given time.
There are a couple of questions to consider. The first of which is when should a lock get renewed? For repeating jobs, we can look at the trigger to see the frequency of execution. The compression job is configured to run every two hours. We could renew the lock every four or six hours to account for those situations when the job runs way over. This would minimize the number of times needed to renew the lock which is good since it is an expensive operation as it involves a LWT. Suppose the server goes down right after the job starts, and then another server is brought up a minute later. The new server will not be able to execute the compression job until the lock expires. When the new server does get to run the job, it will run continuously until it is back on schedule. This will result in big spike in load on Cassandra.
The second question is what should happen if the lock expires? If we are running a single hawkular-metrics server, then it really is not a problem. If multiple servers are running though, we could wind up with multiple instances of the job executing concurrently. We should not have multiple instances of the job running concurrently within the same server because the scheduler keeps an in-memory cache of its actively running jobs. The scheduler will not attempt to run a job if the job is already in the cache.
For the multi-server scenario, I think it would be good to provide some sort of notification to let the job know that the lock has expired. The job itself can then decide what to do.