-
Feature Request
-
Resolution: Unresolved
-
Undefined
-
None
-
2.3
What is the nature and description of the request?
There are moments when the podman configuration on a hybrid Automation Controller or Execution Node becomes corrupted. Reasons for this happening can vary and not easily reproducible. When it does happen, the system administrator must take action and usually involves logging onto the affected server and running podman system reset.
I would like to propose a way for Ansible Automation Platform to perform periodic checks (more than just a heartbeat check) to see if an Execution Node is still able to execute podman commands. If it detects any issue with running a podman command, then Automation Controller will do the following.
- Disable the affected Execution Node (i.e. prevent jobs from running on it)
- Run the podman system reset command
- Run a simple command to test the Execution Node to verify that its good execute jobs
- Re-enable the Execution Node
Think of this RFE as more of a self-healing feature.
Why does the customer need this? (List the business requirements here)
There are some important jobs that run during the night and if podman becomes corrupted on an Execuition Node, this could mean that the job won't run. The system administrator would then need to take action outside of business hours.
This would save time and money for the customer. In addition, would give the customer confidence that application is able to correct itself when an issue occurs.
How would you like to achieve this? (List the functional requirements here)
See description above for possible set of steps to achive this. There might be better ways to go about doing this.
List any affected known dependencies: Doc, UI etc..
- Podman
- Receptor
Github Links