Uploaded image for project: 'Hawkular Metrics'
  1. Hawkular Metrics

Compression job can execute in loop when execution falls far behind


    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Critical
    • Resolution: Done
    • Affects Version/s: 0.21.0
    • Fix Version/s: 0.21.5, 0.22.0
    • Component/s: Scheduler
    • Labels:


      I was doing some testing on my local dev box on which the server had not run for more than 48 hours. When I started it up the compression job resumed execution and was executing continually trying to catch up to its current, scheduled execution. When the job finishes and is behind schedule, a row is inserted into the finished_jobs_idx table and its next execution starts immediately. After that the job scheduler compares the rows in scheduled_jobs_idx with the rows in finished_jobs_idx for the given time slice to determine if all jobs are finished. If they are, then the partitions in those tables are deleted.

      Because the compression job was so far behind schedule, a whole lot executions were running before the job scheduler had a chance to compare scheduled_jobs_idx and finished_jobs_idx. It reached a point the job caught up, the job lock was released, the index tables had not yet been updated, the scheduler looked for jobs to execute, and found a number of rows in scheduled_jobs_idx for already completed executions. This put the job scheduler into sort a loop in which it kept rescheduling the job with execution times for which the job had already run.

      When the job scheduler encounters a job whose status is already set to FINISHED, it needs to check if it is already scheduled for its next execution. If it is, then no rescheduling should be performed.

        Gliffy Diagrams




              • Assignee:
                john.sanda John Sanda
                john.sanda John Sanda
              • Votes:
                0 Vote for this issue
                1 Start watching this issue


                • Created: