Loading...

XML

Word

Printable

Proposed high level workflows

deploy ihanext via k8s primitives or operator (preferred)
eventually add 'evacuable' metadata to flavors/images/aggregates (optional, configurable)

queries nova api for compute-node status every 30s (configurable)
filters out disabled hosts as it is assumed operators would have disabled them for maintenance purposes
looks for hosts that are not disabled and that did not report their status for 30s (configurable)
checks how many of the compute nodes are impacted. If more than half of them are experiencing trouble it will not evacuate as the failure scenario is assumed to be of the disaster type.

verifies if compute hosts workloads that need to be evacuated (checking 'evacuable' metadata)
performs ipmi-based fencing (off/on). fencing.yaml needs to be populated with ip/port/user/pass details for each compute. See example in this repo as the format changed.
calls nova to mark the host as force_down and disables it explicitly, adding a meaninful message and a timestamp in the "Disable Reason" field
performs evacuation of workloads

"fire and forget" mode: feature parity with current IHA implementation
"smart evacuation": evacuates up to X VMs from Y computes in parallel and polls each individual evacuation status. Both X and Y will be configurable.
- will re-try the evacuation up to 5 times (configurable) and eventually give up.
- host will not be re-enabled if any of the evacuations failed. No further evacuations will be attempted.