-
Story
-
Resolution: Unresolved
-
Undefined
-
rhos-18.0.17 FR 5
-
Feature Tracking
-
5
-
False
-
-
False
-
Not Selected
-
rhos-ops-platform-services-pidone
-
-
-
-
Sprint 17
-
1
Description:
Implement the PodRemediator feature in infra-operator.
This includes:
- CRD and API: Add the PodRemediator custom resource under remediation.openstack.org (e.g. v1beta1): types for Spec (e.g. namespaces to watch, EnablePVCRemediation) and Status (conditions), OpenAPI validation, defaulting, and CRD manifests.
- Controller logic: Implement the reconciler in internal/controller/remediation/ so that it:
- Checks that Node Health Check (NHC) and Self Node Remediation (SNR) are present (e.g. via dynamic client on medik8s APIs); sets Ready=False with a clear message when they are missing.
- Lists cluster nodes and determines which are unhealthy (e.g. NodeReady not True).
- For each configured namespace, lists PVCs, resolves the bound PV, and determines if the PV is “local” (e.g. node affinity + Local / HostPath / CSI) and which node it is on (e.g. kubernetes.io/hostname, LVMS/TopoLVM topology keys).
- For PVCs bound to a local PV on an unhealthy node: optionally annotates the PVC with a “pending-deletion” intent (so a restarted controller can resume), then deletes the PVC so the workload can be rescheduled.
- Sets Ready=True when NHC/SNR are present and (if applicable) EnablePVCRemediation is true and there are no blocking errors; otherwise sets Ready=False with an appropriate reason/message.
- Wiring: Register the controller in main.go (scheme, client, dynamic client, manager), set up watches for PodRemediator, Node, PVC, and optionally Pod so reconciliation is triggered on relevant changes.
Acceptance criteria:
- PodRemediator CRD and API types are implemented and generated (e.g. make generate); CRD can be applied to the cluster.
- Controller runs in the infra-operator manager in openstack-operators and reconciles PodRemediator CRs.
- When NHC and SNR are installed and supplemental RBAC is applied, the PodRemediator CR reaches status.conditions[type=Ready].status=True; when NHC/SNR are missing or check fails, Ready is False with a clear message.
- When a node hosting a local PV becomes NotReady, the controller deletes PVCs bound to that node’s local PVs (with optional pending-deletion annotation for resume); when nodes are healthy, no PVCs are deleted.
- Local PV detection supports at least node-affinity-based local volumes (e.g. hostPath, local, CSI with node affinity) and common topology keys (e.g. hostname, LVMS/TopoLVM).
- RBAC is in place for nodes, PVs, PVCs, NHC/SNR APIs, and PodRemediator CRs; controller has no broader permissions than needed.
- Unit or functional tests cover the main behaviour (e.g. NHC/SNR check, local PV detection, or reconciliation logic) as per project standards.
- mentioned in
-
Page Loading...