Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-7959

Ability to pause/disable the Node Health Check Operator During planned maintenance in RHOCP 4

XMLWordPrintable

    • None
    • Product / Portfolio Work
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      1. Proposed title of this feature request
      —>
      Get an ability to Pause/temporarily disable the Node Health Check Operator During Planned Maintenance in RHOCP cluster.

      2. What is the nature and description of the request?
      —>
      The customer is using node-healthcheck-operator.v0.9.0 on OpenShift 4.
      During a planned rolling restart of OpenShift nodes (for maintenance), the nodes temporarily entered the NotReady state, which is expected during reboots.

      However, the Node Health Check (NHC) operator interpreted this as a node failure and responded by:

      • Triggering an additional reboot of the node.
      • Applying taints to prevent pod scheduling.

      This behavior is undesirable during planned maintenance events, where such node transitions are expected and controlled.

      Requesting a feature enhancement to allow temporarily disabling or pausing the NHC operator during planned maintenance activities. Specifically, we propose:

      • A way to inform or signal the NHC operator that a node is undergoing planned maintenance, so it should not take any remediation action.
      • This could be implemented via:
            - A node annotation or label (e.g., maintenance=true)
            - A field in the NodeHealthCheck CRD to pause or disable temporarily
            - Integration with known maintenance workflows or cordon/drain tools.

      3. Why does the customer need this? (List the business requirements here)
      —>

      • To prevent unintended disruption during maintenance windows.
      • To reduce unnecessary node reboots and tainting.
      • To improve operational control for administrators.
      • To align NHC behavior with cluster management best practices.

      4. List any affected packages or components.
      —>

      • node-healthcheck-operator
      • MachineHealthCheck
      • machine-api-operator

              racedoro@redhat.com Ramon Acedo
              rhn-support-sdharma Suruchi Dharma
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                None
                None