Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-3351

Investigate InstanceHA in nextgen

XMLWordPrintable

    • 21
    • False
    • Hide

      None

      Show
      None
    • False
    • Proposed
    • Proposed
    • Committed
    • Proposed
    • 0% To Do, 100% In Progress, 0% Done

      Proposed high level workflows

      Operators

      day1
      • deploy ihanext via k8s primitives or operator (preferred)
      • eventually add 'evacuable' metadata to flavors/images/aggregates (optional, configurable)
      day2
      • if compute node needs maintenance simply disable it via nova api
      • temporarily disabling ihanext: scale replicas to zero
      • removing ihanext: scale to zero, delete k8s objects (deployment, secret, etc)

      IHANext:

      • queries nova api for compute-node status every 30s (configurable)
      • filters out disabled hosts as it is assumed operators would have disabled them for maintenance purposes
      • looks for hosts that are not disabled and that did not report their status for 30s (configurable)
      • checks how many of the compute nodes are impacted. If more than half of them are experiencing trouble it will not evacuate as the failure scenario is assumed to be of the disaster type.
      Evacuation workflow:
      1. verifies if compute hosts workloads that need to be evacuated (checking 'evacuable' metadata)
      2. performs ipmi-based fencing (off/on). fencing.yaml needs to be populated with ip/port/user/pass details for each compute. See example in this repo as the format changed.
      3. calls nova to mark the host as force_down and disables it explicitly, adding a meaninful message and a timestamp in the "Disable Reason" field
      4. performs evacuation of workloads
      Evacuation modes (configurable):
      • "fire and forget" mode: feature parity with current IHA implementation
      • "smart evacuation": evacuates up to X VMs from Y computes in parallel and polls each individual evacuation status. Both X and Y will be configurable.
        • will re-try the evacuation up to 5 times (configurable) and eventually give up.
        • host will not be re-enabled if any of the evacuations failed. No further evacuations will be attempted.

              rhn-support-lmiccini Luca Miccini
              rhn-support-lmiccini Luca Miccini
              rhos-dfg-pidone
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: