-
Feature
-
Resolution: Unresolved
-
Normal
-
None
-
rhos-18.0.17 FR 5, rhos-18.0.14 FR 4
-
None
-
Not Selected
-
False
-
False
-
-
-
0
-
rhos-ops-platform-services-pidone
Goal
Provide a controller (PodRemediator) in the infra-operator that integrates with Node Health Check (NHC) and Self Node Remediation (SNR) to remediate pods with local PVCs when worker nodes are marked unhealthy. This implements the direction investigated in OSPRH-14880 and improves cluster resilience for stateful workloads (e.g. Galera, RabbitMQ) on LVMS or local storage.
Acceptance Criteria
- PodRemediator CRD, RBAC (nodes, PVCs, NHC/SNR APIs), and controller logic are implemented in infra-operator; PodRemediator CR is available and the controller runs in the openstack-operators namespace.
- Controller reaches Ready=True when NHC and SNR are installed and the supplemental RBAC is applied.
- POC can be reproduced from zero: deploy custom infra-operator image, apply CRD/RBAC, install NHC/SNR, create PodRemediator CR; optional local-PVC test can be applied and verified on the POC cluster
- Any behavior is documented and aligned with operator expectations (e.g. Galera, RabbitMQ) regarding storage recovery and pod rescheduling; no changes inside mariadb-operator or rabbitmq-cluster-operator are required for this Epic.
References
- Parent Epic:
OSPRH-14880(Handling non-graceful worker node shutdowns: SNR/NHC behavior and PVCs cleanup on LVMS with Galera & RabbitMQ). - Slack discussion and design docs:
- rhn-support-aromito docs (problem summary + approach): Doc 1, Doc 2.
- SNR/NHC doc (historical reference): Doc.
- rhn-support-lmiccini's proposal: PodRemediator CR in infra-operator (same pattern as BGP PR 322); medik8s SNR change.
- is triggered by
-
OSPRH-14880 Handling non-graceful worker node shutdowns: SNR/NHC behavior and PVCs cleanup on LVMS with Galera & RabbitMQ
-
- Closed
-