-
Feature
-
Resolution: Unresolved
-
Major
-
None
-
None
-
21
-
False
-
-
False
-
Proposed
-
Proposed
-
Committed
-
Proposed
-
0% To Do, 100% In Progress, 0% Done
-
-
-
PIDONE Board, PIDONE 18.0.4
-
2024Q1, 2024Q2
Proposed high level workflows
Operators
day1
- deploy ihanext via k8s primitives or operator (preferred)
- eventually add 'evacuable' metadata to flavors/images/aggregates (optional, configurable)
day2
- if compute node needs maintenance simply disable it via nova api
- temporarily disabling ihanext: scale replicas to zero
- removing ihanext: scale to zero, delete k8s objects (deployment, secret, etc)
IHANext:
- queries nova api for compute-node status every 30s (configurable)
- filters out disabled hosts as it is assumed operators would have disabled them for maintenance purposes
- looks for hosts that are not disabled and that did not report their status for 30s (configurable)
- checks how many of the compute nodes are impacted. If more than half of them are experiencing trouble it will not evacuate as the failure scenario is assumed to be of the disaster type.
Evacuation workflow:
- verifies if compute hosts workloads that need to be evacuated (checking 'evacuable' metadata)
- performs ipmi-based fencing (off/on). fencing.yaml needs to be populated with ip/port/user/pass details for each compute. See example in this repo as the format changed.
- calls nova to mark the host as force_down and disables it explicitly, adding a meaninful message and a timestamp in the "Disable Reason" field
- performs evacuation of workloads
Evacuation modes (configurable):
- "fire and forget" mode: feature parity with current IHA implementation
- "smart evacuation": evacuates up to X VMs from Y computes in parallel and polls each individual evacuation status. Both X and Y will be configurable.
- will re-try the evacuation up to 5 times (configurable) and eventually give up.
- host will not be re-enabled if any of the evacuations failed. No further evacuations will be attempted.