Uploaded image for project: 'Red Hat Workload Availability'
  1. Red Hat Workload Availability
  2. RHWA-173

[NMO] Controller not aware of node lease updates

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • rhwa-25.3
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Hide

      run pytest edge_tests/management/cluster_life_cycle/test_node_lease.py::TestNodeLease::test_nhc_node_lease_on_worker

       

      Show
      run pytest edge_tests/management/cluster_life_cycle/test_node_lease.py::TestNodeLease::test_nhc_node_lease_on_worker  

      Reproduced 50% of the times the testcase is run. Attached are the nmo controller pod logs for a successful execution and a failed one, together with a must-gather tar file at the time the problem is being reproduced.

      Issue description:

      NHC triggers a remediation (TestRemediation)

       

      [kni@provisionhost-0-0 ~]$ oc get nodehealthchecks.remediation.medik8s.io -n openshift-workload-availability nhc-node-lease-test -oyaml
      apiVersion: remediation.medik8s.io/v1alpha1
      kind: NodeHealthCheck
      metadata:
        creationTimestamp: "2025-07-15T08:16:29Z"
        generation: 1
        name: nhc-node-lease-test
        resourceVersion: "1833017"
        uid: 7bab6b78-58a4-4484-a42d-aa2a762b6f55
      spec:
        minHealthy: 30%
        remediationTemplate:
          apiVersion: test.medik8s.io/v1alpha1
          kind: TestRemediationTemplate
          name: test-remediation-template
          namespace: openshift-workload-availability
        selector:
          matchExpressions:
          - key: node-role.kubernetes.io/worker
            operator: Exists
        unhealthyConditions:
        - duration: 30s
          status: "False"
          type: Ready
        - duration: 30s
          status: Unknown
          type: Ready
      status:
        conditions:
        - lastTransitionTime: "2025-07-15T08:16:29Z"
          message: No issues found, NodeHealthCheck is enabled.
          reason: NodeHealthCheckEnabled
          status: "False"
          type: Disabled
        healthyNodes: 2
        lastUpdateTime: "2025-07-15T08:17:34Z"
        observedNodes: 3
        phase: Remediating
        reason: NHC is remediating 1 nodes
        unhealthyNodes:
        - name: worker-0-2
          remediations:
          - resource:
              apiVersion: test.medik8s.io/v1alpha1
              kind: TestRemediation
              name: worker-0-2
              uid: a8800e73-e8c8-45b3-97b7-fa88b6e9ae57
            started: "2025-07-15T08:17:34Z"
            templateName: test-remediation-template 

      NHC leases the node while in Remediating status

       

       

      [kni@provisionhost-0-0 ~]$ oc get leases.coordination.k8s.io -n medik8s-leases 
      NAME              HOLDER                                AGE
      node-worker-0-2   NodeHealthCheck-nhc-node-lease-test   2m6s 

      NMO CR is created while the node is under remediation

       

       

      [kni@provisionhost-0-0 ~]$ oc get nm -n openshift-workload-availability -oyaml
      apiVersion: v1
      items:
      - apiVersion: nodemaintenance.medik8s.io/v1beta1
        kind: NodeMaintenance
        metadata:
          creationTimestamp: "2025-07-15T08:29:44Z"
          finalizers:
          - foregroundDeleteNodeMaintenance
          generation: 1
          name: node-maintenance-worker-0-2
          resourceVersion: "1838963"
          uid: 2ac16c25-69c0-4df8-ae6b-c3f9e0ddcaa8
        spec:
          nodeName: worker-0-2
          reason: Test node maintenance
        status:
          lastError: 'can''t update or invalidate the lease because it is held by different
            owner: NodeHealthCheck-nhc-node-lease-test'
          lastUpdate: "2025-07-15T08:29:55Z"
          phase: Running
          totalpods: 17
      kind: List
      metadata:
        resourceVersion: ""
      [kni@provisionhost-0-0 ~]$ oc get nm -n openshift-workload-availability -oyaml
      apiVersion: v1
      items:
      - apiVersion: nodemaintenance.medik8s.io/v1beta1
        kind: NodeMaintenance
        metadata:
          creationTimestamp: "2025-07-15T08:29:44Z"
          finalizers:
          - foregroundDeleteNodeMaintenance
          generation: 1
          name: node-maintenance-worker-0-2
          resourceVersion: "1839500"
          uid: 2ac16c25-69c0-4df8-ae6b-c3f9e0ddcaa8
        spec:
          nodeName: worker-0-2
          reason: Test node maintenance
        status:
          lastError: 'can''t update or invalidate the lease because it is held by different
            owner: NodeHealthCheck-nhc-node-lease-test'
          lastUpdate: "2025-07-15T08:31:18Z"
          phase: Running
          totalpods: 17
      kind: List
      metadata:
        resourceVersion: ""
      [kni@provisionhost-0-0 ~]$ oc get nm -n openshift-workload-availability -oyaml
      apiVersion: v1
      items:
      - apiVersion: nodemaintenance.medik8s.io/v1beta1
        kind: NodeMaintenance
        metadata:
          creationTimestamp: "2025-07-15T08:29:44Z"
          finalizers:
          - foregroundDeleteNodeMaintenance
          generation: 1
          name: node-maintenance-worker-0-2
          resourceVersion: "1839500"
          uid: 2ac16c25-69c0-4df8-ae6b-c3f9e0ddcaa8
        spec:
          nodeName: worker-0-2
          reason: Test node maintenance
        status:
          lastError: 'can''t update or invalidate the lease because it is held by different
            owner: NodeHealthCheck-nhc-node-lease-test'
          lastUpdate: "2025-07-15T08:31:18Z"
          phase: Running
          totalpods: 17
      kind: List
      metadata:
        resourceVersion: ""
       

       

      Remediation finishes

      NHC removes lease

      [kni@provisionhost-0-0 ~]$ oc get leases.coordination.k8s.io -n medik8s-leases 
      No resources found in medik8s-leases namespace.

      NMO should be taking the lease on the node and perform eviction/cordoning of the pods running in that node.

      [kni@provisionhost-0-0 ~]$ oc get nm -n openshift-workload-availability -oyaml
      apiVersion: v1
      items:
      - apiVersion: nodemaintenance.medik8s.io/v1beta1
        kind: NodeMaintenance
        metadata:
          creationTimestamp: "2025-07-15T08:29:44Z"
          finalizers:
          - foregroundDeleteNodeMaintenance
          generation: 1
          name: node-maintenance-worker-0-2
          resourceVersion: "1839500"
          uid: 2ac16c25-69c0-4df8-ae6b-c3f9e0ddcaa8
        spec:
          nodeName: worker-0-2
          reason: Test node maintenance
        status:
          lastError: 'can''t update or invalidate the lease because it is held by different
            owner: NodeHealthCheck-nhc-node-lease-test'
          lastUpdate: "2025-07-15T08:31:18Z"
          phase: Running
          totalpods: 17
      kind: List
      metadata:
        resourceVersion: ""
       

              oraz@redhat.com Or Raz
              frmoreno Francisco Javier Moreno Moreno
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: