Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-17604

galera endpoint failover can take a long time during OCP worker outage

XMLWordPrintable

    • 1
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • mariadb-operator-container-1.0.12-2
    • rhos-ops-platform-services-pidone
    • None
    • Sprint 2, Sprint 3
    • 2
    • Important

      Observed during QA testing in our lab,

      When we forcibly destroy a OCP worker/master VM hosting the galera pod that is currently selected as the database traffic endpoint, the other galera pods correctly detect that the pod went away, and a script is correctly executed on one of the surviving pods to update the traffic endpoint.  This works as expected.

      The pod configuring the new endpoint uses curl to call the API server to update the selector of the service object responsible for balancing database traffic.

      If the API server VIP is hosted on the node that was forcibly destroyed, the VIP is is reassigned automatically by OCP, but if that happens while curl is running,  this might leave curl stuck for a very long time trying to establish a connection to a stopped node. During this time, the service object cannot be updated, and database traffic is unavailable, even though galera has recovered almost instantly from the loss of a cluster member.

      To Reproduce Steps to reproduce the behavior:

      1. Deploy a RHOSO control plane on OCP
      2. Look for the current database endpoint
        ```
        $ oc get svc openstack -o wide
        NAME        TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE    SELECTOR
        openstack   ClusterIP   172.30.83.191   <none>        3306/TCP   3d5h   app=galera,cr=galera-openstack,statefulset.kubernetes.io/pod-name=openstack-galera-1
        ```
      3. Look for the node hosting the endpoint pod
        ```
        $ oc get pod -l galera/name=openstack -o wide
        NAME                 READY   STATUS    RESTARTS   AGE   IP               NODE       NOMINATED NODE   READINESS GATES
        openstack-galera-0   1/1     Running   0          18h   192.168.20.105   master-2   <none>           <none>
        openstack-galera-1   1/1     Running   0          18h   192.168.16.128   master-1   <none>           <none>
        openstack-galera-2   1/1     Running   1          18h   192.168.24.26    master-0   <none>           <none>
        ```
      1. Crash the node on the hypervisor
        ```
         virsh destroy cifmw-ocp-master-1
        ```

      Expected behavior

      • the selector in the service object should be updated to a surviving pod fairly quickly

      Bug impact

      • Without a fix, the recovery of the database traffic can take a unacceptably long time.

      Known workaround

      • No automatic workaround.

              rhn-engineering-dciabrin Damien Ciabrini
              rhn-engineering-dciabrin Damien Ciabrini
              rhos-dfg-pidone
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: