-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.14.z, 4.16.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Webhook pods from the openshift-nmstate, openshift-user-workload-monitoring, and openshift-monitoring are logging TLS handshake errors and "connection reset by peer messages" intermittently:
$ oc logs nmstate-webhook-7b5f85f4c6-5nhbj | grep TLS | tail -n6 2025-05-19T09:44:51.455849569Z 2025/05/19 09:44:51 http: TLS handshake error from 10.128.4.2:47710: read tcp 10.130.1.84:9443->10.128.4.2:47710: read: connection reset by peer 2025-05-19T09:44:51.484873929Z 2025/05/19 09:44:51 http: TLS handshake error from 10.128.4.2:47718: EOF 2025-05-19T09:44:51.544165122Z 2025/05/19 09:44:51 http: TLS handshake error from 10.128.4.2:47732: EOF 2025-05-19T09:44:51.581382637Z 2025/05/19 09:44:51 http: TLS handshake error from 10.128.4.2:47744: EOF 2025-05-19T09:44:51.594788646Z 2025/05/19 09:44:51 http: TLS handshake error from 10.128.4.2:47756: EOF 2025-05-19T09:44:51.609079099Z 2025/05/19 09:44:51 http: TLS handshake error from 10.128.4.2:47758: EOF $ oc logs prometheus-operator-admission-webhook-86f48d5c4c-t7fxp| tail -n3 2025-05-19T09:05:04.745831550Z ts=2025-05-19T09:05:04.745704622Z caller=stdlib.go:105 caller=server.go:3215 msg="http: TLS handshake error from 10.128.4.2:55384: EOF" 2025-05-19T09:10:02.066883445Z ts=2025-05-19T09:10:02.066811907Z caller=stdlib.go:105 caller=server.go:3215 msg="http: TLS handshake error from 10.128.4.2:56442: read tcp 10.129.2.10:8443->10.128.4.2:56442: read: connection reset by peer" 2025-05-19T09:20:06.242602992Z ts=2025-05-19T09:20:06.242528325Z caller=stdlib.go:105 caller=server.go:3215 msg="http: TLS handshake error from 10.128.4.2:43294: EOF"
Version-Release number of selected component:
The issue was initially observed at 4.14.z, but the customer upgraded the cluster to 4.16.40 and still seeing the same behavior.
OpenShift 4.16.40
How reproducible:
Using curl to reach the endpoints shows successful connections, but kubelet health check probes to the webhook pods are intermittently failing.
The error was reproducible by using a custom python script in a test cluster.
import socket import struct import time# Replace with the target TLS server IP and port server_ip = "10.129.1.149" server_port = 9443# Create TCP socket s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)# Set SO_LINGER to send RST on close # linger active (1), timeout 0 => send RST instead of FIN s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 0))# Connect to the server s.connect((server_ip, server_port))# Wait a little to ensure server starts TLS handshake time.sleep(0.1)# Close socket: this triggers TCP RST s.close()
That would result in the errors seen, and they are being reported because health check probes are not waiting for a proper connection closure. The suspicion is on kubelet for this, because kubelet sends health-check probes to application pods.
On the affected customer's cluster, below we can see a nmstate webhook pod reporting the "connection reset by peer" error around the same time the readiness probes are sent.
2025/06/10 08:09:36 http: TLS handshake error from 10.130.4.2:48970: read tcp 10.130.4.40:9443->10.130.4.2:48970: read: connection reset by peer Jun 10 08:09:36.417435 master01.preprod.openshift.hh.atg.se kubenswrapper[833363]: I0610 08:09:36.417397 833363 prober.go:154] "HTTP-Probe" scheme="https" host="10.130.4.40" port="9443" path="/readyz" timeout="1s" headers=[{"name":"Content-Type","value":"application/json"}]Jun 10 08:09:36.426810 master01.preprod.openshift.hh.atg.se kubenswrapper[833363]: I0610 08:09:36.426720 833363 http.go:117] Probe succeeded for https://10.130.4.40:9443/readyz, Response: {200 OK 200 HTTP/1.1 1 1 map[Content-Length:[2] Content-Type:[text/plain; charset=utf-8] Date:[Tue, 10 Jun 2025 08:09:36 GMT]] 0xc007884440 2 [] true false map[] 0xc003266800 0xc007f81e40}Jun 10 08:09:36.427023 master01.preprod.openshift.hh.atg.se kubenswrapper[833363]: I0610 08:09:36.426836 833363 prober.go:116] "Probe succeeded" probeType="Readiness" pod="openshift-nmstate/nmstate-webhook-565669744-8wd4g" podUID="cbb633b0-eb29-42fa-b353-420b17650b85" containerName="nmstate-webhook"
The above kubelet logs show success(200 OK 200 HTTP/1.1 1 1). In the nmstate-webhook deployment, the timeoutSeconds is set to 1s, meaning that https request to the POD will close within 1s irrespective of any response. In case of resource crunch on the node, the http probe may take some time, in such case, a proper TCP closure will not be sent to the application POD and this will contribute to the error seen in the nmstate-webhook pod.
So, we've changed the timeoutSeconds to 3s for the nmstate-webhook deployment.
oc -n openshift-nmstate patch deployment/nmstate-webhook --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/0/readinessProbe/timeoutSeconds", "value": 3}]'
But after a few minutes, the timeout setting is overwritten.
- clones
-
OCPBUGS-58209 Webhook pods from "openshift-nmstate" and "openshift-user-workload-monitoring" throwing TLS handshake errors alongside "connection reset by peer" messages
-
- Closed
-
- duplicates
-
OCPBUGS-5916 The kube-rbac-proxy-federate container reporting TLS handshake error
-
- Closed
-