Loading...

XML

Word

Printable

Type: Feature Request
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: openshift-4.19
Component/s: Node
Labels:
- node
- rfe

Target Version:
None
Activity Type:
Product / Portfolio Work
Status Summary:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Products:
None
Hierarchy Progress Bar:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
None
PX Impact Score:
PX Impact Range:
None
PX Priority Data:
None
PX Technical Impact:
None
PX Technical Impact Notes:
None
PX Scheduling Request:
None

1. Proposed title of this feature request

Add a new worker latency profile LowEvictionLatency

2. What is the nature and description of the request?

Add a new latency profile, such as: LowEvictionLatency

With the following tuning values:

default-unreachable-toleration-seconds: 40
All other parameters (including node-monitor-grace-period) remain consistent with the Default latency profile.

3. Why does the customer need this? (List the business requirements here)

StatefulSet workloads are particularly sensitive to delays in pod eviction when a node becomes unreachable. In OpenShift, the current default-unreachable-toleration-seconds value of 300 seconds (5 minutes) causes significant delays in failover for StatefulSet-based applications.

For example, in high-availability configurations using ActiveMQ Broker (AMQ) with leader/follower roles, a sudden node failure results in the follower not assuming leadership until the leader pod is fully evicted—a process currently blocked by the long toleration period. This impacts message availability and system responsiveness.

The root cause of the delay is tied to both Kubernetes scheduling logic and storage-level resource locks (e.g., CephFS file locks remaining held due to stale sessions). While storage configuration changes may mitigate the issue, they often involve trade-offs or limitations (e.g., abandoning ODF).

Justification / Use Case:

Provides a tuned environment specifically for StatefulSet workloads requiring faster failover.
Reduces failover times from 5+ minutes to under 1 minute in case of sudden node failure.
Preserves existing tuning profiles (Default, Medium, High Latency) without impacting current users.
Avoids complex workarounds such as controller type changes or storage migration.

4. List any affected packages or components.

OCP
node.config.openshift.io{}

See: Worker latency profiles

Assignee:: Gaurav Singh

Reporter:: Alberto Gonzalez de Dios

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2025/07/17 11:46 AM

Updated:: 2025/08/20 3:19 AM

Target start:: None

Target end:: None

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates