Loading...

XML

Word

Printable

Type: Feature Request
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.16, 4.18, 4.17
Component/s: Node Health Check Operator
Labels:

Target Version:
None
Activity Type:
Product / Portfolio Work
Status Summary:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Products:
None
Hierarchy Progress Bar:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
None
PX Impact Score:
PX Impact Range:
None
PX Priority Data:
None
PX Technical Impact:
None
PX Technical Impact Notes:
None
PX Scheduling Request:
None

1. Proposed title of this feature request
—>
Get an ability to Pause/temporarily disable the Node Health Check Operator During Planned Maintenance in RHOCP cluster.

2. What is the nature and description of the request?
—>
The customer is using node-healthcheck-operator.v0.9.0 on OpenShift 4.
During a planned rolling restart of OpenShift nodes (for maintenance), the nodes temporarily entered the NotReady state, which is expected during reboots.

However, the Node Health Check (NHC) operator interpreted this as a node failure and responded by:

Triggering an additional reboot of the node.
Applying taints to prevent pod scheduling.

This behavior is undesirable during planned maintenance events, where such node transitions are expected and controlled.

Requesting a feature enhancement to allow temporarily disabling or pausing the NHC operator during planned maintenance activities. Specifically, we propose:

A way to inform or signal the NHC operator that a node is undergoing planned maintenance, so it should not take any remediation action.
This could be implemented via:
- A node annotation or label (e.g., maintenance=true)
- A field in the NodeHealthCheck CRD to pause or disable temporarily
- Integration with known maintenance workflows or cordon/drain tools.

3. Why does the customer need this? (List the business requirements here)
—>

To prevent unintended disruption during maintenance windows.
To reduce unnecessary node reboots and tainting.
To improve operational control for administrators.
To align NHC behavior with cluster management best practices.

4. List any affected packages or components.
—>

node-healthcheck-operator
MachineHealthCheck
machine-api-operator

Assignee:: Ramon Acedo

Reporter:: Suruchi Dharma

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2025/07/31 7:19 PM

Updated:: 2025/10/10 7:40 PM

Target start:: None

Target end:: None

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates