-
Bug
-
Resolution: Done
-
Critical
-
None
-
4.16.z
-
Quality / Stability / Reliability
-
False
-
-
3
-
None
-
None
-
None
-
None
-
None
-
CORENET Sprint 273, CORENET Sprint 276
-
2
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
During SDN -> OVN migration, the OpenShift API server is unstable. It goes on and off healthy state. During a recent incident it was observed that the kubelet was failing to pull images. The error was 500 Internal Server Error. So it seems the OpenShift API Server being unhealthy was causing the failure in pulling images. After we observed that the OAS was complaining about not being able to reach ETCD, we replaced the corresponding master node that hosted the OAS. This action had significant positive impact on the cluster and things the migration progressed further.
Version-Release number of selected component (if applicable):
4.16.41
How reproducible:
Not clear at this point, but we think this has potential to happen during SDN -> OVN migrations
Steps to Reproduce:
We observed this during an incident. We are not sure exactly how to reproduce this again.
Actual results:
We observed logs pointing to failing image pulls
I0710 22:54:46.275657 1 generator.go:783] Error resolving ImageStreamTag <redacted-tag>:latest in namespace <redacted-namespace>: unable to find latest tagged image I0710 22:55:06.790381 1 generator.go:783] Error resolving ImageStreamTag <redacted-tag>:latest in namespace <redacted-namespace>: unable to find latest tagged image I0710 23:01:12.498164 1 generator.go:783] Error resolving ImageStreamTag <redacted-tag>:latest in namespace <redacted-namespace>: unable to find latest tagged image I0710 23:01:12.528673 1 generator.go:783] Error resolving ImageStreamTag <redacted-tag>:latest in namespace <redacted-namespace>: unable to find latest tagged image W0710 23:01:49.971997 1 logging.go:59] [core] [Channel #6774 SubChannel #6777] grpc: addrConn.createTransport failed to connect to {Addr: "10.*.*.*:*", ServerName: "10.*.*.*:*", }. Err: connection error: desc = "error reading server preface: read tcp 10.*.*.*:*->10.*.*.*:*: use of closed network connection" Trace[2093165510]: ---"Write to database call failed" len:126,err:Internal error occurred: admission plugin "build.openshift.io/BuildByStrategy" failed to complete validation in 13s 13000ms (23:03:13.784) I0710 23:06:00.659934 1 generator.go:783] Error resolving ImageStreamTag <redacted-tag>:latest in namespace <redacted-namespace>: unable to find latest tagged image Trace[230092673]: ---"Write to database call failed" len:126,err:Internal error occurred: admission plugin "build.openshift.io/BuildByStrategy" failed to complete validation in 13s 13001ms (23:12:55.541)
From one of the image-registry pods
time="2025-07-10T23:12:21.475868443Z" level=error msg="response completed with error" err.code=unknown err.detail="ImageStream:Unkno wn: Exists: failed to get image stream openshift/tools: ImageStreamGetter:Unknown: openshift/tools: the server is currently unable to handle the request (get imagestreams.image.openshift.io tools)" err.message="unknown error" go.version="go1.21.13 (Red Hat 1.21.13-8.el9_4) X:strictfipsruntime" http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=429b4030-3431-4926-8ed8-45e2d42d3e95 http.request.method=GET http.request.remoteaddr="10.*.*.*:*" http.request.uri=/v2/openshift/tools/manifests/latest http.request.useragent="cri-o/1.29.13 os/linux arch/amd64" http.response.contenttype=application/json http.response.dura tion=30.007318189s http.response.status=500 http.response.written=70 openshift.auth.user="system:serviceaccount:<redacted-ns>:default" openshift.auth.userid=265e7586-4998-4825-94f1-fe7cf336ad71 vars.name=openshift/tools vars.reference=latest
There were also logs regarding failing to reach an etcd IP endpoint
E0710 23:58:27.549014 1 strategy.go:60] unable to parse manifest for "sha256:<redacted-hash>": unexpected end of JSON input W0710 23:58:55.233550 1 logging.go:59] [core] [Channel #1759 SubChannel #1761] grpc: addrConn.createTransport failed to connect to {Addr: "10.*.*.*:*", ServerName: "10.*.*.*:*", }. Err: connection error: desc = "error reading server preface: read tcp 10.*.*.*:*->10.*.*.*:*: use of closed network connection" I0710 23:59:07.115460 1 generator.go:783] Error resolving ImageStreamTag <redacted-tag>:latest in namespace <redacted-namespace>: unable to find latest tagged image
Expected results:
OpenShift API server should be generally stable during the migration process.
Additional info:
affected version: 4.16.41
- duplicates
-
OCPBUGS-55282 OpenShift CNI Live Migration blocked due to Pod Disruption Budget and network bridge failure between SDN and OVN
-
- Closed
-