Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-11488

failover of the galera service exposes disconnected galera node to the openstack clients

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • rhos-18.0.4
    • rhos-18.0.0
    • mariadb-operator
    • None
    • 7
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • mariadb-operator-container-1.0.6-5
    • ?
    • ?
    • None
    • PIDONE 18.0.4, PIDONE 18.0.5
    • 2
    • Important

      The Galera cluster managed by the mariadb-operator currently configures a single galera node for service, out of the 3 available galera (A/P configuration).
      When the currently active galera node is stopping (e.g. when a minor update triggers  a rolling restart of the galera cluster), it triggers a script that calls the k8s API server to reconfigure the service selector to remove galera node from the selector (make the node inactive). Other galera nodes get notified of the node stopping, and another available node is chosen to become the new active galera node.

      There is currently an issue in the failover process when the stopping node removes itself from the service selector. At the time the script is called, the galera node has already started its disconnection from the galera cluster, so the galera replication is stopped; however the mysql server itself hasn't close its SQL socket yet, which means openstack clients are still able to communicate with the server.
      This time window is sufficiently large that openstack clients can sometimes see the database service as being not fully functional, which manifests itself as the following oslo.db error in the logs:

      [Tue Nov 12 15:13:29.451071 2024] [wsgi:error] [pid 24:tid 56] [remote 192.168.56.69:34548] 2024-11-12 15:13:29.449 24 ERROR keystone.server.flask.request_processing.middleware.auth_context [None req-660b7567-04f6-4109-b00e-d94ccf952afd - - - - - -] (pymysql.err.OperationalError) (1047, 'WSREP has not yet prepared node for application
      use')
      [Tue Nov 12 15:13:29.451106 2024] [wsgi:error] [pid 24:tid 56] [remote 192.168.56.69:34548] (Background on this error at: https://sqlalche.me/e/14/e3q8): oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (1047, 'WSREP has not yet prepared node for application use')                                                     
      [Tue Nov 12 15:13:29.451177 2024] [wsgi:error] [pid 24:tid 56] [remote 192.168.56.69:34548] 2024-11-12 15:13:29.449 24 ERROR keystone.server.flask.request_processing.middleware.auth_context pymysql.err.OperationalError: (1047, 'WSREP has not yet prepared node for application use')    

      Only a moment later, when the script succesfully reconfigured the K8s service and the galera node is stopped, the client's socket sees the disconnection.

                                                       
      cd9da060-3798-4e60-873d-508d87c02e81 - - - - - -] Database connection was found disconnected; reconnecting: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError)
      (2013, 'Lost connection to MySQL server during query')

      The unexpected WSREP status is not handled by oslo.db nor openstack clients, which do not retry their ongoing SQL actions properly. This causes a perceived openstack API error from a end-user's perspective.

       

      The Openstack codebase that handles DB reconnections didn't change between OSP17 and 18, however this WSREP status is exposed differently in 18, which is manifesting in minor updates causing temporary openstack API disruption, which is not expected and should be avoided.

              rhn-engineering-dciabrin Damien Ciabrini
              rhn-engineering-dciabrin Damien Ciabrini
              rhos-dfg-pidone
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: