-
Feature Request
-
Resolution: Unresolved
-
Major
-
None
-
4.16, 4.18, 4.17
-
None
-
Product / Portfolio Work
-
None
-
False
-
-
None
-
None
-
None
-
-
None
-
None
-
None
-
None
-
None
1. Proposed title of this feature request
—>
Get an ability to Pause/temporarily disable the Node Health Check Operator During Planned Maintenance in RHOCP cluster.
2. What is the nature and description of the request?
—>
The customer is using node-healthcheck-operator.v0.9.0 on OpenShift 4.
During a planned rolling restart of OpenShift nodes (for maintenance), the nodes temporarily entered the NotReady state, which is expected during reboots.
However, the Node Health Check (NHC) operator interpreted this as a node failure and responded by:
- Triggering an additional reboot of the node.
- Applying taints to prevent pod scheduling.
This behavior is undesirable during planned maintenance events, where such node transitions are expected and controlled.
Requesting a feature enhancement to allow temporarily disabling or pausing the NHC operator during planned maintenance activities. Specifically, we propose:
- A way to inform or signal the NHC operator that a node is undergoing planned maintenance, so it should not take any remediation action.
- This could be implemented via:
- A node annotation or label (e.g., maintenance=true)
- A field in the NodeHealthCheck CRD to pause or disable temporarily
- Integration with known maintenance workflows or cordon/drain tools.
3. Why does the customer need this? (List the business requirements here)
—>
- To prevent unintended disruption during maintenance windows.
- To reduce unnecessary node reboots and tainting.
- To improve operational control for administrators.
- To align NHC behavior with cluster management best practices.
4. List any affected packages or components.
—>
- node-healthcheck-operator
- MachineHealthCheck
- machine-api-operator