-
Feature Request
-
Resolution: Unresolved
-
Major
-
None
-
2.2
-
False
-
-
False
Feature Overview
- start a job
- kill the controller "controlling" the job (not executing!)
- the job isn't failed, it's taken over by the next available controller
Background, and strategic fit
Increased reliability of the cluster, less jobs failing because of the platform.
Customer is expecting this feature to take out AWX and replace with AAP.
(Optional) Use Cases
n/a
Assumptions
- the controller isn't hybrid and doesn't run the job
- The job explanation that "Task was marked as running but was not present in the job queue, so it has been marked as failed" seems to hint that the logic to detect such cases is already existing and could be extended to place back the "interrupted" job in the queue (this would require a different job status) for another controller to take back under control.
- This of course assumes that all necessary information is available in the database and that the execution host can accept that another controller than the initial one can take control of an already existing job.
Out of Scope
- make a job survive the death of the execution host
- is cloned by
-
AAPRFE-583 Job can survive the death of the Mesh path used to control
- Backlog
- relates to
-
AAPRFE-186 Better failure handling of running jobs on execution nodes
- Backlog