-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
rhwa-25.3
-
None
-
False
-
-
False
-
-
-
Reproduced 50% of the times the testcase is run. Attached are the nmo controller pod logs for a successful execution and a failed one, together with a must-gather tar file at the time the problem is being reproduced.
Issue description:
NHC triggers a remediation (TestRemediation)
[kni@provisionhost-0-0 ~]$ oc get nodehealthchecks.remediation.medik8s.io -n openshift-workload-availability nhc-node-lease-test -oyaml apiVersion: remediation.medik8s.io/v1alpha1 kind: NodeHealthCheck metadata: creationTimestamp: "2025-07-15T08:16:29Z" generation: 1 name: nhc-node-lease-test resourceVersion: "1833017" uid: 7bab6b78-58a4-4484-a42d-aa2a762b6f55 spec: minHealthy: 30% remediationTemplate: apiVersion: test.medik8s.io/v1alpha1 kind: TestRemediationTemplate name: test-remediation-template namespace: openshift-workload-availability selector: matchExpressions: - key: node-role.kubernetes.io/worker operator: Exists unhealthyConditions: - duration: 30s status: "False" type: Ready - duration: 30s status: Unknown type: Ready status: conditions: - lastTransitionTime: "2025-07-15T08:16:29Z" message: No issues found, NodeHealthCheck is enabled. reason: NodeHealthCheckEnabled status: "False" type: Disabled healthyNodes: 2 lastUpdateTime: "2025-07-15T08:17:34Z" observedNodes: 3 phase: Remediating reason: NHC is remediating 1 nodes unhealthyNodes: - name: worker-0-2 remediations: - resource: apiVersion: test.medik8s.io/v1alpha1 kind: TestRemediation name: worker-0-2 uid: a8800e73-e8c8-45b3-97b7-fa88b6e9ae57 started: "2025-07-15T08:17:34Z" templateName: test-remediation-template
NHC leases the node while in Remediating status
[kni@provisionhost-0-0 ~]$ oc get leases.coordination.k8s.io -n medik8s-leases NAME HOLDER AGE node-worker-0-2 NodeHealthCheck-nhc-node-lease-test 2m6s
NMO CR is created while the node is under remediation
[kni@provisionhost-0-0 ~]$ oc get nm -n openshift-workload-availability -oyaml apiVersion: v1 items: - apiVersion: nodemaintenance.medik8s.io/v1beta1 kind: NodeMaintenance metadata: creationTimestamp: "2025-07-15T08:29:44Z" finalizers: - foregroundDeleteNodeMaintenance generation: 1 name: node-maintenance-worker-0-2 resourceVersion: "1838963" uid: 2ac16c25-69c0-4df8-ae6b-c3f9e0ddcaa8 spec: nodeName: worker-0-2 reason: Test node maintenance status: lastError: 'can''t update or invalidate the lease because it is held by different owner: NodeHealthCheck-nhc-node-lease-test' lastUpdate: "2025-07-15T08:29:55Z" phase: Running totalpods: 17 kind: List metadata: resourceVersion: "" [kni@provisionhost-0-0 ~]$ oc get nm -n openshift-workload-availability -oyaml apiVersion: v1 items: - apiVersion: nodemaintenance.medik8s.io/v1beta1 kind: NodeMaintenance metadata: creationTimestamp: "2025-07-15T08:29:44Z" finalizers: - foregroundDeleteNodeMaintenance generation: 1 name: node-maintenance-worker-0-2 resourceVersion: "1839500" uid: 2ac16c25-69c0-4df8-ae6b-c3f9e0ddcaa8 spec: nodeName: worker-0-2 reason: Test node maintenance status: lastError: 'can''t update or invalidate the lease because it is held by different owner: NodeHealthCheck-nhc-node-lease-test' lastUpdate: "2025-07-15T08:31:18Z" phase: Running totalpods: 17 kind: List metadata: resourceVersion: "" [kni@provisionhost-0-0 ~]$ oc get nm -n openshift-workload-availability -oyaml apiVersion: v1 items: - apiVersion: nodemaintenance.medik8s.io/v1beta1 kind: NodeMaintenance metadata: creationTimestamp: "2025-07-15T08:29:44Z" finalizers: - foregroundDeleteNodeMaintenance generation: 1 name: node-maintenance-worker-0-2 resourceVersion: "1839500" uid: 2ac16c25-69c0-4df8-ae6b-c3f9e0ddcaa8 spec: nodeName: worker-0-2 reason: Test node maintenance status: lastError: 'can''t update or invalidate the lease because it is held by different owner: NodeHealthCheck-nhc-node-lease-test' lastUpdate: "2025-07-15T08:31:18Z" phase: Running totalpods: 17 kind: List metadata: resourceVersion: ""
Remediation finishes
NHC removes lease
[kni@provisionhost-0-0 ~]$ oc get leases.coordination.k8s.io -n medik8s-leases No resources found in medik8s-leases namespace.
NMO should be taking the lease on the node and perform eviction/cordoning of the pods running in that node.
[kni@provisionhost-0-0 ~]$ oc get nm -n openshift-workload-availability -oyaml apiVersion: v1 items: - apiVersion: nodemaintenance.medik8s.io/v1beta1 kind: NodeMaintenance metadata: creationTimestamp: "2025-07-15T08:29:44Z" finalizers: - foregroundDeleteNodeMaintenance generation: 1 name: node-maintenance-worker-0-2 resourceVersion: "1839500" uid: 2ac16c25-69c0-4df8-ae6b-c3f9e0ddcaa8 spec: nodeName: worker-0-2 reason: Test node maintenance status: lastError: 'can''t update or invalidate the lease because it is held by different owner: NodeHealthCheck-nhc-node-lease-test' lastUpdate: "2025-07-15T08:31:18Z" phase: Running totalpods: 17 kind: List metadata: resourceVersion: ""