Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-59222

OpenShift API Server unstable during SDN -> OVN migration

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 3
    • None
    • None
    • None
    • None
    • None
    • CORENET Sprint 273, CORENET Sprint 276
    • 2
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      During SDN -> OVN migration, the OpenShift API server is unstable. It goes on and off healthy state. During a recent incident it was observed that the kubelet was failing to pull images. The error was 500 Internal Server Error. So it seems the OpenShift API Server being unhealthy was causing the failure in pulling images.
      
      After we observed that the OAS was complaining about not being able to reach ETCD, we replaced the corresponding master node that hosted the OAS. This action had significant positive impact on the cluster and things the migration progressed further.
          

      Version-Release number of selected component (if applicable):

      4.16.41
          

      How reproducible:

      Not clear at this point, but we think this has potential to happen during SDN -> OVN migrations
          

      Steps to Reproduce:

      We observed this during an incident. We are not sure exactly how to reproduce this again.
          

      Actual results:
      We observed logs pointing to failing image pulls

      I0710 22:54:46.275657       1 generator.go:783] Error resolving ImageStreamTag <redacted-tag>:latest in namespace <redacted-namespace>: unable to find latest tagged image
      I0710 22:55:06.790381       1 generator.go:783] Error resolving ImageStreamTag <redacted-tag>:latest in namespace <redacted-namespace>: unable to find latest tagged image
      I0710 23:01:12.498164       1 generator.go:783] Error resolving ImageStreamTag <redacted-tag>:latest in namespace <redacted-namespace>: unable to find latest tagged image
      I0710 23:01:12.528673       1 generator.go:783] Error resolving ImageStreamTag <redacted-tag>:latest in namespace <redacted-namespace>: unable to find latest tagged image
      W0710 23:01:49.971997       1 logging.go:59] [core] [Channel #6774 SubChannel #6777] grpc: addrConn.createTransport failed to connect to {Addr: "10.*.*.*:*", ServerName: "10.*.*.*:*", }. Err: connection error: desc = "error reading server preface: read tcp 10.*.*.*:*->10.*.*.*:*: use of closed network connection"
      Trace[2093165510]: ---"Write to database call failed" len:126,err:Internal error occurred: admission plugin "build.openshift.io/BuildByStrategy" failed to complete validation in 13s 13000ms (23:03:13.784)
      I0710 23:06:00.659934       1 generator.go:783] Error resolving ImageStreamTag <redacted-tag>:latest in namespace <redacted-namespace>: unable to find latest tagged image
      Trace[230092673]: ---"Write to database call failed" len:126,err:Internal error occurred: admission plugin "build.openshift.io/BuildByStrategy" failed to complete validation in 13s 13001ms (23:12:55.541)
      

      From one of the image-registry pods

      time="2025-07-10T23:12:21.475868443Z" level=error msg="response completed with error" err.code=unknown err.detail="ImageStream:Unkno
      wn: Exists: failed to get image stream openshift/tools: ImageStreamGetter:Unknown: openshift/tools: the server is currently unable to handle the request (get imagestreams.image.openshift.io tools)" err.message="unknown error" go.version="go1.21.13 (Red Hat 1.21.13-8.el9_4) X:strictfipsruntime" http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=429b4030-3431-4926-8ed8-45e2d42d3e95 http.request.method=GET http.request.remoteaddr="10.*.*.*:*" http.request.uri=/v2/openshift/tools/manifests/latest http.request.useragent="cri-o/1.29.13 os/linux arch/amd64" http.response.contenttype=application/json http.response.dura
      tion=30.007318189s http.response.status=500 http.response.written=70 openshift.auth.user="system:serviceaccount:<redacted-ns>:default" openshift.auth.userid=265e7586-4998-4825-94f1-fe7cf336ad71 vars.name=openshift/tools vars.reference=latest   
      

      There were also logs regarding failing to reach an etcd IP endpoint

      E0710 23:58:27.549014       1 strategy.go:60] unable to parse manifest for "sha256:<redacted-hash>": unexpected end of JSON input
      W0710 23:58:55.233550       1 logging.go:59] [core] [Channel #1759 SubChannel #1761] grpc: addrConn.createTransport failed to connect to {Addr: "10.*.*.*:*", ServerName: "10.*.*.*:*", }. Err: connection error: desc = "error reading server preface: read tcp 10.*.*.*:*->10.*.*.*:*: use of closed network connection"
      I0710 23:59:07.115460       1 generator.go:783] Error resolving ImageStreamTag <redacted-tag>:latest in namespace <redacted-namespace>: unable to find latest tagged image
      

      Expected results:

      OpenShift API server should be generally stable during the migration process.
          

      Additional info:

      affected version: 4.16.41
          

              jluhrsen Jamo Luhrsen
              taislam.osd Tafhim Ul Islam
              None
              None
              Zhanqi Zhao Zhanqi Zhao
              None
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: