-
Task
-
Resolution: Done
-
Undefined
-
rhos-18.0.0
What is the probability and severity of the issue? I.e. the overall risk
Medium probability, high impact.
When galera pods experience a temporary disconnection from the Galera cluster
(going to non-primary partition), they are never detected by the pod's health probe,
and consequently can never be restarted automatically by openshift.
This can lead to service outage which is difficult to troubleshoot for customer
and can only be fixed by a manual intervention.
The fixed health probe is already implemented and merged upstream in main.
Getting that fixed will resolve both OSPRH-8862 and OSPRH-5705.
Does this affect specific configurations, hardware, environmental factors, etc.?
All deployment are impacted.
Are any partners relying on this functionality in order to ship an ecosystem product?
No.
What proportion of our customers could hit this issue?
All of them could it it. Any network disruption in the customer's environment could
trigger it.
Does this happen for only a specific use case?
No, all openstack control plane workloads can be impacted.
What proportion of our CI infrastructure, automation, and test cases does this issue impact?
No impact upstream as the fixed got merged post 18.0.1.
Is this a regression in supported functionality from a previous release?
Yes as the automatic restart provided by HA does not work correctly.
Is there a clear workaround?
The workaround consists is force-restarting the galera pods manually, which require
customer knowledge and intervention.
Is there potential doc impact?
No doc impact because this is nothing more that a bug bug that should be addressed.
If this is a UI issue:
Is the UI still fit for its purpose/goal?
N/A
Does the bug compromise the overall trustworthiness of the UI?
N/A
Overall context and effort – is the overall impact bigger/worse than the bug in isolation? For example, 1 workaround might seem ok, 5 is getting ugly, 20 might be unacceptable (rough numbers).
1