Loading...

XML

Word

Printable

Type: Task
Resolution: Done
Priority: Undefined
Fix Version/s: rhos-18.0.2
Affects Version/s: rhos-18.0.0
Component/s: mariadb-operator
Labels:
- PIDONE
- rhos-trac

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Dev Approval:
?
Docs Approval:
?
PM Approval:
?
QE Approval:
?
Intelligence Requested:
Market:
Target Version:

rhos-18.0.2

Severity:
Important

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

What is the probability and severity of the issue? I.e. the overall risk
Medium probability, high impact.
When galera pods experience a temporary disconnection from the Galera cluster
(going to non-primary partition), they are never detected by the pod's health probe,
and consequently can never be restarted automatically by openshift.
This can lead to service outage which is difficult to troubleshoot for customer
and can only be fixed by a manual intervention.

The fixed health probe is already implemented and merged upstream in main.

Getting that fixed will resolve both OSPRH-8862 and OSPRH-5705.

Does this affect specific configurations, hardware, environmental factors, etc.?
All deployment are impacted.

Are any partners relying on this functionality in order to ship an ecosystem product?
No.

What proportion of our customers could hit this issue?
All of them could it it. Any network disruption in the customer's environment could
trigger it.

Does this happen for only a specific use case?
No, all openstack control plane workloads can be impacted.

What proportion of our CI infrastructure, automation, and test cases does this issue impact?
No impact upstream as the fixed got merged post 18.0.1.

Is this a regression in supported functionality from a previous release?
Yes as the automatic restart provided by HA does not work correctly.

Is there a clear workaround?
The workaround consists is force-restarting the galera pods manually, which require
customer knowledge and intervention.

Is there potential doc impact?
No doc impact because this is nothing more that a bug bug that should be addressed.

If this is a UI issue:
Is the UI still fit for its purpose/goal?
N/A

Does the bug compromise the overall trustworthiness of the UI?
N/A

Overall context and effort – is the overall impact bigger/worse than the bug in isolation? For example, 1 workaround might seem ok, 5 is getting ugly, 20 might be unacceptable (rough numbers).
1

Assignee:: Unassigned

Reporter:: Damien Ciabrini

Team:: rhos-dfg-pidone

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2024/09/24 8:52 AM

Updated:: 2024/09/30 8:04 PM

Resolved:: 2024/09/30 8:04 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty