Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: rhwa-25.3
Affects Version/s: None
Component/s: Node Healthcheck, Self Node Remediation
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Release Note Text:

Hide
When the API Server experiences issues, the Self Node Remediation (SNR) Operator might trigger unnecessary reboots of the nodes. (~~ECOPROJECT-2642~~)

1) Cause: When the API server experiences issues, it might have caused the SNR peer check to fail. A request timeout on the peer server side did not always work as expected, and the peer’s response was sent too late. This might have caused the peer client on the node to think that it is isolated, and to trigger a reboot.
2) Consequence: Nodes are unnecessarily rebooted when the API server experienced issues.
3) Fix: The timeout handling on the peer check server was enhanced.
4) Result: API server issues no longer result in unnecessary node reboots.

Show
When the API Server experiences issues, the Self Node Remediation (SNR) Operator might trigger unnecessary reboots of the nodes. ( ECOPROJECT-2642 ) 1) Cause: When the API server experiences issues, it might have caused the SNR peer check to fail. A request timeout on the peer server side did not always work as expected, and the peer’s response was sent too late. This might have caused the peer client on the node to think that it is isolated, and to trigger a reboot. 2) Consequence: Nodes are unnecessarily rebooted when the API server experienced issues. 3) Fix: The timeout handling on the peer check server was enhanced. 4) Result: API server issues no longer result in unnecessary node reboots.
Release Note Type:
Bug Fix
Release Note Status:
In Progress
Intelligence Requested:
Market:

Target Version:

rhwa-25.3

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

During a PoC we trying to proof a cluster-store and it failed because of a reboot storm caused by SNR.

Cluster design/size:

3 schedule able control plane nodes
4 worker nodes

Versions: OpenShift v4.17

Details follow with must-gather

Our procedure:

Power off / hard shutdown 2 of 3 control plane nodes
=> API is not available any more, as expected
Start the restore based on our documentation: 5.3.2. Restoring to a previous cluster state
Etcd backup
Start cluster-restore.sh
SNR kicked in and rebooted the master node and all other nodes.

=> Test failed!

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

snr_logs.tar
190 kB
2025/05/26 6:57 PM
a157691013150f9.pcap
860 kB
2025/05/21 3:05 AM
self-node-remediation-ds-9p49k.log
59 kB
2025/05/21 3:05 AM

links to

medik8s/self-node-remediation#262: Fix health server timeout

medik8s/self-node-remediation#265: [release-0.10] Fix health server timeout

medik8s/self-node-remediation#266: [release-0.9] Fix health server timeout

RHEA-2024:136700 Self Node Remediation 0.9.1

Assignee:: Marc Sluiter

Reporter:: Robert Bohne

Contributors:: Joachim von Thadden, Marc Schindler

Votes:: 2 Vote for this issue

Watchers:: 25 Start watching this issue

Created:: 2025/01/31 12:46 PM

Updated:: 2025/10/22 8:40 AM

Resolved:: 2025/07/08 10:38 AM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty