Loading...

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.14.z, 4.16.z
Component/s: Monitoring
Labels:
- TLS
- kubelet
- openshift-4.14
- openshift-4.16
- timeout

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Webhook pods from the openshift-nmstate, openshift-user-workload-monitoring, and openshift-monitoring are logging TLS handshake errors and "connection reset by peer messages" intermittently:

$ oc logs nmstate-webhook-7b5f85f4c6-5nhbj | grep TLS | tail -n6
2025-05-19T09:44:51.455849569Z 2025/05/19 09:44:51 http: TLS handshake error from 10.128.4.2:47710: read tcp 10.130.1.84:9443->10.128.4.2:47710: read: connection reset by peer
2025-05-19T09:44:51.484873929Z 2025/05/19 09:44:51 http: TLS handshake error from 10.128.4.2:47718: EOF
2025-05-19T09:44:51.544165122Z 2025/05/19 09:44:51 http: TLS handshake error from 10.128.4.2:47732: EOF
2025-05-19T09:44:51.581382637Z 2025/05/19 09:44:51 http: TLS handshake error from 10.128.4.2:47744: EOF
2025-05-19T09:44:51.594788646Z 2025/05/19 09:44:51 http: TLS handshake error from 10.128.4.2:47756: EOF
2025-05-19T09:44:51.609079099Z 2025/05/19 09:44:51 http: TLS handshake error from 10.128.4.2:47758: EOF

$ oc logs prometheus-operator-admission-webhook-86f48d5c4c-t7fxp| tail -n3
2025-05-19T09:05:04.745831550Z ts=2025-05-19T09:05:04.745704622Z caller=stdlib.go:105 caller=server.go:3215 msg="http: TLS handshake error from 10.128.4.2:55384: EOF"
2025-05-19T09:10:02.066883445Z ts=2025-05-19T09:10:02.066811907Z caller=stdlib.go:105 caller=server.go:3215 msg="http: TLS handshake error from 10.128.4.2:56442: read tcp 10.129.2.10:8443->10.128.4.2:56442: read: connection reset by peer"
2025-05-19T09:20:06.242602992Z ts=2025-05-19T09:20:06.242528325Z caller=stdlib.go:105 caller=server.go:3215 msg="http: TLS handshake error from 10.128.4.2:43294: EOF"

Version-Release number of selected component:
The issue was initially observed at 4.14.z, but the customer upgraded the cluster to 4.16.40 and still seeing the same behavior.

    OpenShift 4.16.40

How reproducible:
Using curl to reach the endpoints shows successful connections, but kubelet health check probes to the webhook pods are intermittently failing.
The error was reproducible by using a custom python script in a test cluster.

import socket
import struct
import time# Replace with the target TLS server IP and port
server_ip = "10.129.1.149"
server_port = 9443# Create TCP socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)# Set SO_LINGER to send RST on close
# linger active (1), timeout 0 => send RST instead of FIN
s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 0))# Connect to the server
s.connect((server_ip, server_port))# Wait a little to ensure server starts TLS handshake
time.sleep(0.1)# Close socket: this triggers TCP RST
s.close()

That would result in the errors seen, and they are being reported because health check probes are not waiting for a proper connection closure. The suspicion is on kubelet for this, because kubelet sends health-check probes to application pods.
On the affected customer's cluster, below we can see a nmstate webhook pod reporting the "connection reset by peer" error around the same time the readiness probes are sent.

2025/06/10 08:09:36 http: TLS handshake error from 10.130.4.2:48970: read tcp 10.130.4.40:9443->10.130.4.2:48970: read: connection reset by peer

Jun 10 08:09:36.417435 master01.preprod.openshift.hh.atg.se kubenswrapper[833363]: I0610 08:09:36.417397  833363 prober.go:154] "HTTP-Probe" scheme="https" host="10.130.4.40" port="9443" path="/readyz" timeout="1s" headers=[{"name":"Content-Type","value":"application/json"}]Jun 10 08:09:36.426810 master01.preprod.openshift.hh.atg.se kubenswrapper[833363]: I0610 08:09:36.426720  833363 http.go:117] Probe succeeded for https://10.130.4.40:9443/readyz, Response: {200 OK 200 HTTP/1.1 1 1 map[Content-Length:[2] Content-Type:[text/plain; charset=utf-8] Date:[Tue, 10 Jun 2025 08:09:36 GMT]] 0xc007884440 2 [] true false map[] 0xc003266800 0xc007f81e40}Jun 10 08:09:36.427023 master01.preprod.openshift.hh.atg.se kubenswrapper[833363]: I0610 08:09:36.426836  833363 prober.go:116] "Probe succeeded" probeType="Readiness" pod="openshift-nmstate/nmstate-webhook-565669744-8wd4g" podUID="cbb633b0-eb29-42fa-b353-420b17650b85" containerName="nmstate-webhook"

The above kubelet logs show success(200 OK 200 HTTP/1.1 1 1). In the nmstate-webhook deployment, the timeoutSeconds is set to 1s, meaning that https request to the POD will close within 1s irrespective of any response. In case of resource crunch on the node, the http probe may take some time, in such case, a proper TCP closure will not be sent to the application POD and this will contribute to the error seen in the nmstate-webhook pod.

So, we've changed the timeoutSeconds to 3s for the nmstate-webhook deployment.

oc -n openshift-nmstate patch deployment/nmstate-webhook --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/0/readinessProbe/timeoutSeconds", "value": 3}]'

But after a few minutes, the timeout setting is overwritten.

clones

OCPBUGS-58209 Webhook pods from "openshift-nmstate" and "openshift-user-workload-monitoring" throwing TLS handshake errors alongside "connection reset by peer" messages

Closed

duplicates

OCPBUGS-5916 The kube-rbac-proxy-federate container reporting TLS handshake error

Closed

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates