Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: rhwa-25.3, rhwa-25.2
Affects Version/s: rhwa-25.1, rhwa-24.3, rhwa-23.3
Component/s: Self Node Remediation
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
True
Release Note Text:

Hide
When a node has Self Node Remediation (SNR) and Node Health Check (NHC) enabled, the node will not reboot when the network is shutdown. (~~RHWA-13~~)

1) Cause: An empty Pod IP was included in the list of peers to contact, for the api-check connectivity status
2) Consequence: The api-check request was sent to localhost (default, in case of an empty IP), and this resulted in the pod contacting itself.
3) Fix: Ensure empty Pod IPs are excluded from the list of peers to be checked.
4) Result: The pod can only contact other peers to verify connectivity.

Show
When a node has Self Node Remediation (SNR) and Node Health Check (NHC) enabled, the node will not reboot when the network is shutdown. ( RHWA-13 ) 1) Cause: An empty Pod IP was included in the list of peers to contact, for the api-check connectivity status 2) Consequence: The api-check request was sent to localhost (default, in case of an empty IP), and this resulted in the pod contacting itself. 3) Fix: Ensure empty Pod IPs are excluded from the list of peers to be checked. 4) Result: The pod can only contact other peers to verify connectivity.
Release Note Type:
Bug Fix
Release Note Status:
Done
Intelligence Requested:
Market:

Severity:
Critical

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Hello Michael,

as suggested i'm opening this one.

Our customer is not getting the node rebooted when they shutdown the network in a node with NHC and SNR in place.

Basically these are the steps:

they shutdown the interface in the node
remediation seems to take place, the node is drained but NOT rebooted
once the connectivity is restored the node is rebooted

What you've found during the troubleshooting analysing the snr logs, is that the node is somehow still communicating with peers, and so is not marked as isolated and rebooted.

The customer tried to disable completely the networking in the node, and the behaviour is the same.

I'm attaching in the Jira the last snr agent log.

In that one interface shutdown has been done on
Wed Apr 23 08:32:15 UTC 2025
Customer noticed these lines

INFO api-check getting health status from peer {"IP": ""}
INFO api-check.peerhealth client new peer client {"serveraddr": ":30001"}
...
INFO api-check got response from peer {"IP": "", "status": 3}

and he's asking if this communication attempt could be the cause of the node not marked unhealthy.

Let me know if you need anything else from the customer end.

We already have and shared with you a consistent amount of data.

links to

medik8s/self-node-remediation#259: Fix: Handle missing PodIP in Node assignment

RHEA-2025:145522 Self Node Remediation 0.10.1

mentioned on

Merge request - Updated US source to: c3d368e Merge pull request #260 from openshift-cherrypick-robot/cherry-pick-259-to-release-0.10

Assignee:: Carlo Lobrano

Reporter:: Luca Davidde

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Created:: 2025/04/29 4:01 PM

Updated:: 2025/07/10 10:05 AM

Resolved:: 2025/05/15 7:38 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty