Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: mariadb-operator
Labels:
None

Story Points:
0
Epic Link:
[BugEpic]: Cinder and galera pods keep restarting
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs Approval:
?
AssignedTeam:
rhos-ops-platform-services-pidone
Regression:
None
Intelligence Requested:
Market:
PX Impact Score:

Severity:
Critical

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

The customer reported intermittent issues with the Horizon dashboard displaying "Something went wrong!" error messages between 05:25 PM and 06:40 PM IST on specific dates. During this period, users were unable to view resources, and VM creation jobs failed due to the automatic restart of several backend OpenStack control-plane pods, including Galera (MariaDB), Cinder Scheduler, Cinder Volume, and Cinder Backup.

1- Cinder and galera pods keep restarting affecting the customer creation on new instances

2- Readiness and Liveness connection timeout causing constant restarting

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 42m (x18 over 23d) kubelet Readiness probe failed: command timed out
Warning Unhealthy 42m (x18 over 23d) kubelet Liveness probe failed: command timed out
Normal Started 41m (x16 over 21h) kubelet Started container galera
Normal Pulled 41m (x17 over 21h) kubelet Container image "registry.redhat.io/rhoso/openstack-mariadb-rhel9@sha256:2dd44ddf73d775c9b60421f14e4808bdda377cc57b864bf2d9a1bebd63fd6b41" already present on machine
Normal Created 41m (x17 over 21h) kubelet Created container: galera
Normal Killing 41m (x5 over 20h) kubelet Container galera failed startup probe, will be restarted
Warning FailedPreStopHook 41m (x2 over 19h) kubelet PreStopHook failed
Warning Unhealthy 41m kubelet Startup probe failed: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111)
Warning Unhealthy 34m kubelet Readiness probe failed: wsrep_local_state_comment (Inconsistent) differs from Synced
Warning Unhealthy 31m (x7 over 19h) kubelet Readiness probe failed: wsrep_local_state_comment (Initialized) differs from Synced
Warning BackOff 28m (x44 over 20h) kubelet Back-off restarting failed container galera in pod openstack-galera-0_openstack(65196497-8cc3-4245-8147-f58b6392eda1)

3- The pods went in CrashLoopBackOff state

4-From the latest describe logs the state became running again but keeps restarting and has high restart count

/usr/local/bin/kolla_start

State: Running

Started: Wed, 03 Dec 2025 17:20:59 +0530

Last State: Terminated

Reason: Error

Exit Code: 134

Started: Wed, 03 Dec 2025 17:10:48 +0530

Finished: Wed, 03 Dec 2025 17:18:36 +0530

Ready: True

Restart Count: 36

5 - The Galera pod restarted and doing SST (State Snapshot Transfer), due to InnoDB corruption [1], SST failed, thus cannot join back to the Galera cluster, Liveness probe failure triggered;

Here comes the loop what we observed.

[1] 2025-12-03 11:48:34 80 [ERROR] InnoDB: Database page corruption on disk or a failed read of file './neutron/networksegments.ibd' page [page id: space=447, page number=0]. You may have to recover from a backup.
2025-12-03 11:48:34 80 [Note] InnoDB: Page dump (16384 bytes):
2025-12-03 11:48:34 80 [Note] InnoDB: 1703030013d86c6ebedfef30aa4ef2ea2caa86a81bdabcbd0008000000000000
...
2025-12-03 11:48:34 80 [Note] InnoDB: 0000000000000000000000000000000000000000000000000057803933ccf488
2025-12-03 11:48:34 80 [Note] InnoDB: End of page dump
2025-12-03 11:48:34 80 [Note] InnoDB: You can use CHECK TABLE to scan your table for corruption. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
2025-12-03 11:48:34 80 [ERROR] [FATAL] InnoDB: Unable to read page [page id: space=447, page number=0] into the buffer pool after 100. The most probable cause of this error may be that the table has been corrupted. See https://mariadb.com/kb/en/library/innodb-recovery-modes/
251203 11:48:34 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

unfortunately customer don't have any mariadb backup.

Need help in resolving the issue and identify the root cause why it got corrupted.

Assignee:: Damien Ciabrini

Reporter:: Anjana B S

Contributors:: Luca Miccini

Team:: rhos-dfg-pidone

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2025/12/04 5:15 AM

Updated:: 2025/12/17 8:40 AM

Resolved:: 2025/12/17 8:40 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty