Loading...

XML

Word

Printable

Type: Story
Resolution: Done
Priority: Major
Fix Version/s: rhwa-25.9
Affects Version/s: None
Component/s: Self Node Remediation
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Release Note Text:

Hide
Cause: The SNR manager is placed on an unhealthy node, and it is evicted.
Consequence: The SNR manager is rescheduled on a healthy node, which delays processing the remediation CR.
Fix: Run SNR manager deployment with 2 replicas that won't be co-located on the same node because the deployment includes a topologySpreadConstraints with maxSkew: 1, topologyKey: kubernetes.io/hostname, whenUnsatisfiable: DoNotSchedule
Result: When the SNR manager is evicted from a faulty node, another SNR manager from a healthy node can take control, process the CR, and reduce the SNR delay.

Show
Cause: The SNR manager is placed on an unhealthy node, and it is evicted. Consequence: The SNR manager is rescheduled on a healthy node, which delays processing the remediation CR. Fix: Run SNR manager deployment with 2 replicas that won't be co-located on the same node because the deployment includes a topologySpreadConstraints with maxSkew: 1, topologyKey: kubernetes.io/hostname, whenUnsatisfiable: DoNotSchedule Result: When the SNR manager is evicted from a faulty node, another SNR manager from a healthy node can take control, process the CR, and reduce the SNR delay.
Release Note Type:
Feature
Release Note Status:
Proposed
Intelligence Requested:
Market:

Target Version:

rhwa-25.9

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

The SNR node remediation time can be long in case a node becomes unhealthy and the self-node-remediation-controller-manager is running on the unhealthy node.

This delay will occur with the following commit: https://github.com/medik8s/self-node-remediation/commit/748e5742284dd672d91a553a205496189288b592

This patch changes to the following behavior:

[Before merging this patch]
The SNR daemonset pod in each node tries to add/delete the out-of-service taint when a SNR object is created.

[After merging this patch]
Only the self-node-remediation-controller-manager pod and the SNR daemonset pod in the failed node tries to add/delete the out-of-service taint when a SNR object is created.

We can't trust the failed node, so only the self-node-remediation-controller-manager will be able to remediate the failed node. So when the manager is being evicted, the node remediation is delayed.

To avoid this remediation delay, the manager should be executed with 2 replicas.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

RHWA-363-4.20-connected-RHOCP-4.20-Nov-14.text
7 kB
2025/11/13 7:51 PM

causes

RHWA-366 Investigate update strategy with topologySpreadConstraints

relates to

RHWA-364 [FAR] Improve HA by using 'topologySpreadConstraints' to enforce strict pod distribution for fence-agents-controller-manager replicas

Review

RHWA-365 [NHC] Improve HA by using 'topologySpreadConstraints' to enforce strict pod distribution for node-healthcheck-controller-manager replicas

Review

links to

medik8s/self-node-remediation#180: Run manager with 2 replicas

mentioned on

Merge request - TELCODOCS-2544: RHWA 25.9 Release Notes first draft / Common Attributes updated

Assignee:: Keiichi Kii

Reporter:: Keiichi Kii

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2024/01/27 1:14 AM

Updated:: 2025/11/24 8:35 AM

Resolved:: 2025/11/17 9:08 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty