Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-10339

TRAC blocker: It takes a long time for the Galera cluster to recover after the disruptive action on all its nodes.

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • ?
    • ?
    • Important

      What is the probability and severity of the issue? I.e. the overall risk
      Medium probability, high impact.
      When galera pods experience a temporary disconnection from the Galera cluster
      (going to non-primary partition), they are never detected by the pod's health probe,
      and consequently can never be restarted automatically by openshift.
      This can lead to service outage which is difficult to troubleshoot for customer
      and can only be fixed by a manual intervention.

      The fixed health probe is already implemented and merged upstream in main.

      Getting that fixed will resolve both OSPRH-8862 and OSPRH-5705.

      Does this affect specific configurations, hardware, environmental factors, etc.?
      All deployment are impacted.

      Are any partners relying on this functionality in order to ship an ecosystem product?
      No.

      What proportion of our customers could hit this issue?
      All of them could it it. Any network disruption in the customer's environment could
      trigger it.

      Does this happen for only a specific use case?
      No, all openstack control plane workloads can be impacted.

      What proportion of our CI infrastructure, automation, and test cases does this issue impact?
      No impact upstream as the fixed got merged post 18.0.1.

      Is this a regression in supported functionality from a previous release?
      Yes as the automatic restart provided by HA does not work correctly.

      Is there a clear workaround?
      The workaround consists is force-restarting the galera pods manually, which require
      customer knowledge and intervention.

      Is there potential doc impact?
      No doc impact because this is nothing more that a bug bug that should be addressed.

      If this is a UI issue:
      Is the UI still fit for its purpose/goal?
      N/A

      Does the bug compromise the overall trustworthiness of the UI?
      N/A

      Overall context and effort – is the overall impact bigger/worse than the bug in isolation? For example, 1 workaround might seem ok, 5 is getting ugly, 20 might be unacceptable (rough numbers).
      1

            Unassigned Unassigned
            rhn-engineering-dciabrin Damien Ciabrini
            rhos-dfg-pidone
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: