Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-22814

Cinder and galera pods keep restarting

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Critical Critical
    • None
    • None
    • mariadb-operator
    • None
    • Critical

      The customer reported intermittent issues with the Horizon dashboard displaying "Something went wrong!" error messages between 05:25 PM and 06:40 PM IST on specific dates. During this period, users were unable to view resources, and VM creation jobs failed due to the automatic restart of several backend OpenStack control-plane pods, including Galera (MariaDB), Cinder Scheduler, Cinder Volume, and Cinder Backup.

      1- Cinder and galera pods keep restarting affecting the customer creation on new instances

      2- Readiness and Liveness connection timeout causing constant restarting

      Events:
      Type Reason Age From Message
      ---- ------ ---- ---- -------
      Warning Unhealthy 42m (x18 over 23d) kubelet Readiness probe failed: command timed out
      Warning Unhealthy 42m (x18 over 23d) kubelet Liveness probe failed: command timed out
      Normal Started 41m (x16 over 21h) kubelet Started container galera
      Normal Pulled 41m (x17 over 21h) kubelet Container image "registry.redhat.io/rhoso/openstack-mariadb-rhel9@sha256:2dd44ddf73d775c9b60421f14e4808bdda377cc57b864bf2d9a1bebd63fd6b41" already present on machine
      Normal Created 41m (x17 over 21h) kubelet Created container: galera
      Normal Killing 41m (x5 over 20h) kubelet Container galera failed startup probe, will be restarted
      Warning FailedPreStopHook 41m (x2 over 19h) kubelet PreStopHook failed
      Warning Unhealthy 41m kubelet Startup probe failed: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111)
      Warning Unhealthy 34m kubelet Readiness probe failed: wsrep_local_state_comment (Inconsistent) differs from Synced
      Warning Unhealthy 31m (x7 over 19h) kubelet Readiness probe failed: wsrep_local_state_comment (Initialized) differs from Synced
      Warning BackOff 28m (x44 over 20h) kubelet Back-off restarting failed container galera in pod openstack-galera-0_openstack(65196497-8cc3-4245-8147-f58b6392eda1)

      3- The pods went in CrashLoopBackOff state

      4-From the latest describe logs the state became running again but keeps restarting and has high restart count

      /usr/local/bin/kolla_start

      State: Running

      Started: Wed, 03 Dec 2025 17:20:59 +0530

      Last State: Terminated

      Reason: Error

      Exit Code: 134

      Started: Wed, 03 Dec 2025 17:10:48 +0530

      Finished: Wed, 03 Dec 2025 17:18:36 +0530

      Ready: True

      Restart Count: 36

      5 - The Galera pod restarted and doing SST (State Snapshot Transfer), due to InnoDB corruption [1], SST failed, thus cannot join back to the Galera cluster, Liveness probe failure triggered;

      Here comes the loop what we observed.

      [1] 2025-12-03 11:48:34 80 [ERROR] InnoDB: Database page corruption on disk or a failed read of file './neutron/networksegments.ibd' page [page id: space=447, page number=0]. You may have to recover from a backup.
      2025-12-03 11:48:34 80 [Note] InnoDB: Page dump (16384 bytes):
      2025-12-03 11:48:34 80 [Note] InnoDB: 1703030013d86c6ebedfef30aa4ef2ea2caa86a81bdabcbd0008000000000000
      ...
      2025-12-03 11:48:34 80 [Note] InnoDB: 0000000000000000000000000000000000000000000000000057803933ccf488
      2025-12-03 11:48:34 80 [Note] InnoDB: End of page dump
      2025-12-03 11:48:34 80 [Note] InnoDB: You can use CHECK TABLE to scan your table for corruption. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
      2025-12-03 11:48:34 80 [ERROR] [FATAL] InnoDB: Unable to read page [page id: space=447, page number=0] into the buffer pool after 100. The most probable cause of this error may be that the table has been corrupted. See https://mariadb.com/kb/en/library/innodb-recovery-modes/
      251203 11:48:34 [ERROR] mysqld got signal 6 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.

      unfortunately customer don't have any mariadb backup.

      Need help in resolving the issue and identify the root cause why it got corrupted.

              rhn-engineering-dciabrin Damien Ciabrini
              rh-ee-anbs Anjana B S
              Luca Miccini
              rhos-dfg-pidone
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: