-
Bug
-
Resolution: Done
-
Major
-
1.0.0.Final
-
None
When restarting a partitioned chunk job execution, org.jberet.runtime.runner.StepExecutionRunner#beginPartition calculates the number of partitions based on unfinished partitions in the original step execution. It works in most cases, but not in following case:
- the original job was abruptly terminated or aborted by user pressing Ctrl-C, or kill, or JVM, OS, or machine crash, and some partitions have no entry in PARTITION_EXECUTION table. This is more likely to happen when thread-count attribute is less than the partitions, and some partitioins will need to wait for available threads. As a result, these unstarted partitions are forgotten in the restart.
Strictly speaking, in the above case the batch status of the original job execution will be STARTED, STARTING, or STOPPING, and cannot be restarted. But to support restarting killed or crashed job executions , this will need to be fixed.
When a job execution failed or stopped normally, the PARTITION_EXECUTION table should still have all partitions.
In some cases, it is normal for the restart to have fewer partition executions than configured in job xml. For example,
1, in the first run of a step with 5 partitioins, p0, p1 completed and p2, p3, p4 failed and the step and the job failed normally (not killed or crashed);
2-a, during restart, only 3 partitions need to run (p2, p3, p4). If p2 completed and p3, p4 failed, the next restart only need to run p3, p4.
Or,
2-b, during restart, rerun p2, p3, p4, p2 completed, but p3, p4 crashed. The next restart should ideally only need to rerun p3 and p4.
- relates to
-
JBERET-154 Support restarting killed or crashed job execution
- Resolved