-
Epic
-
Resolution: Done
-
Normal
-
None
-
rhos-18.0.5
-
SNR/NHC: Managing PVCs on worker failures with Galera & RabbitMQ
-
False
-
-
False
-
Not Selected
-
Proposed
-
Proposed
-
Done
-
Proposed
-
rhos-ops-platform-services-pidone
-
Proposed
-
0% To Do, 0% In Progress, 100% Done
-
-
-
Moderate
Goal:
The goal of this Epic is to investigate and define how Self Node Remediation (SNR) and Node Health Check (NHC) should handle non-graceful worker node shutdowns in an environment running Galera & RabbitMQ operators.
Specifically, we aim to determine whether SNR should perform automatic cleanup of Persistent Volume Claims (PVCs) when a node fails, and how this interacts with Logical Volume Manager Storage (LVMS).
By addressing this, we can ensure better pod rescheduling behavior, prevent storage inconsistencies, and improve cluster resilience when handling node failures.
Acceptance Criteria:
- Define the current behavior of SNR/NHC when a worker node with LVMS storage fails non-gracefully.
- Assess whether automatic PVC deletion should be handled within SNR, NHC, or another component.
- Identify potential risks of deleting PVC (e.g., volume attachments, KCM detach issues).
- Evaluate if escalation remediation could be leveraged for delayed or conditional cleanup.
- Determine if we should provide an optional user-configurable setting for PVC deletion.
- Ensure that any new behavior aligns with Galera & RabbitMQ operator expectations regarding storage recovery and pod rescheduling.
Open Questions:
- How does SNR currently handle volume attachments, and does this interfere with KCM operations?
- Should NHC be responsible for triggering storage cleanup, or should it be offloaded to an external process?
- Is escalation remediation a viable alternative for handling cleanup, and how should it be implemented?
- Is it expected that the Out-of-Service taint does not trigger the detach or deletion of PVCs?
- Is it expected that the ResourceDeletion strategy does not handle PVC detachment or deletion?
- relates to
-
OSPRH-20578 To study the gap to allow deployment of galera on non local disks
-
- Backlog
-