Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-18408

galera endpoint failover might rely on stale endpoints data during OCP worker outage

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Major Major
    • rhos-18.0.11
    • rhos-18.0.8
    • mariadb-operator
    • None
    • 3
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • mariadb-operator-container-1.0.13-2
    • rhos-ops-platform-services-pidone
    • None
    • Sprint 3, Sprint 4
    • 2
    • Important

      This is a follow up to OSPRH-17604,

      Observed during QA testing in our lab,

      The operator script that implements service endpoint failover contains internal logic to probe the up-to-date state of the gcomm cluster. This is done when the script starts, or when a command failed and is retried.
          
      The list of members is extracted from a mysql table which is not guaranteed to be up-to-date when e.g. a node disappears from the cluster due to a network partition. Consequently, sometimes the failover heuristic can chose a node that is no longer part of the galera cluster, which leads to long service outage.
          

      To Reproduce Steps to reproduce the behavior:

      1. Deploy a RHOSO control plane on OCP
      2. Look for the current database endpoint
        ```
        $ oc get svc openstack -o wide
        NAME        TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE    SELECTOR
        openstack   ClusterIP   172.30.83.191   <none>        3306/TCP   3d5h   app=galera,cr=galera-openstack,statefulset.kubernetes.io/pod-name=openstack-galera-1
        ```
      3. Look for the node hosting the endpoint pod
        ```
        $ oc get pod -l galera/name=openstack -o wide
        NAME                 READY   STATUS    RESTARTS   AGE   IP               NODE       NOMINATED NODE   READINESS GATES
        openstack-galera-0   1/1     Running   0          18h   192.168.20.105   master-2   <none>           <none>
        openstack-galera-1   1/1     Running   0          18h   192.168.16.128   master-1   <none>           <none>
        openstack-galera-2   1/1     Running   1          18h   192.168.24.26    master-0   <none>           <none>
        ```
      1. Crash the node on the hypervisor
        ```
         virsh destroy cifmw-ocp-master-1
        ```

      Expected behavior

      • the selector in the service object should be updated to a surviving pod fairly quickly

      Bug impact

      • Without a fix, the recovery of the database traffic can take a long time.

      Known workaround

      • No automatic workaround.

              rhn-engineering-dciabrin Damien Ciabrini
              rhn-engineering-dciabrin Damien Ciabrini
              rhos-dfg-pidone
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: