-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
rhos-18.0.6
-
None
-
Bug Tracking
-
0
-
False
-
-
False
-
?
-
rhos-ops-platform-services-pidone
-
None
-
-
-
-
Sprint 11, Sprint 12
-
2
-
Important
Description
Witnessed on a customer environment, running post-FR2 version of the mariadb operator.
The pods of the two galera CRs were restarted after an environment issue, which meant the respective galera clusters stopped and needed to be restarted.
Still, we could not detect any sign of reconciliation event taking place in the mariadb operator. It all looked like the stop event did not get sent to the operator, which in turn could not restart the clusters.
At this stage the galera pod would regularly hit a liveness probe error because no galera server could be restarted, leading to a recurring restart of pods.
This is currently not picked up by the mariadb-operator, who only reacts to change in statefulset's availableReplicas.
Bug impact
Major service disruption, as database service goes into outage and is not resolved automatically.
Known workaround
Restarting the mariadb-operator forces a initial reconciliation, so the cluster can be restarted.