Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-23075

mariadb-operator stops reconciling galera pods, leaving restarting in wait state

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • rhos-18.0.6
    • mariadb-operator
    • None
    • Sprint 11, Sprint 12
    • 2
    • Important

      Description

      Witnessed on a customer environment, running post-FR2 version of the mariadb operator.

      The pods of the two galera CRs were restarted after an environment issue, which meant the respective galera clusters stopped and needed to be restarted.

      Still, we could not detect any sign of reconciliation event taking place in the mariadb operator. It all looked like the stop event did not get sent to the operator, which in turn could not restart the clusters.

      At this stage the galera pod would regularly hit a liveness probe error because no galera server could be restarted, leading to a recurring restart of pods. 
      This is currently not picked up by the mariadb-operator, who only reacts to change in statefulset's availableReplicas.  

      Bug impact

      Major service disruption, as database service goes into outage and is not resolved automatically.

      Known workaround

      Restarting the mariadb-operator forces a initial reconciliation, so the cluster can be restarted.

       

              rhn-engineering-dciabrin Damien Ciabrini
              rhn-engineering-dciabrin Damien Ciabrini
              rhos-dfg-pidone
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: