Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-3016

Mass Metal Failures Blocking Payloads: Pods stuck initializing + ErrImagePull

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • 4.12
    • Test Framework
    • None
    • False
    • Hide

      None

      Show
      None

      Noticed the mass failures in: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.12.0-0.nightly/release/4.12.0-0.nightly-2022-10-31-232349

      So far just one payload so may resolve on it's own, but for now considering it is one as it looks very broken.

      From the above link these two jobs are good candidates for investigation:

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-ovn-ipv6/1587351295310696448

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-sdn-bm/1587335377918627840

      Using the spyglass charts in both of the above links you'll see lots of pods stuck in pending, as well as alerts related to PodDisruptionLimit and PodStartupStorageOperationsFailing.

      Tracing one of the pods stuck pending down to the node it's on (visible in spyglass), then that hosts systemd journal, we see something like:

      Nov 01 09:22:44.912411 worker-0.ostest.test.metalkube.org kubenswrapper[3387]: E1101 09:22:44.912401    3387 kuberuntime_manager.go:862] container &Container{Name:donothing,Image:virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-28-registry-k8s-io-pause-3-8-aP7uYsw5XCmoDy5W,Command:[],Args:[],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:kube-api-access-6fr4z,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod pod-0_e2e-disruption-7215(a28ffad0-d526-4add-9a6b-7b5a2780b6a0): ErrImagePull: rpc error: code = Unknown desc = reading manifest e2e-28-registry-k8s-io-pause-3-8-aP7uYsw5XCmoDy5W in virthost.ostest.test.metalkube.org:5000/localimages/local-test-image: manifest unknown: manifest unknown
      

      Or:

      Nov 01 08:13:53.817430 host5.cluster9.ocpci.eng.rdu2.redhat.com kubenswrapper[2385]: E1101 08:13:53.817387    2385 kuberuntime_manager.go:862] container &Container{Name:donothing,Image:host1.cluster9.ocpci.eng.rdu2.redhat.com:5000/localimages/local-test-image:e2e-28-registry-k8s-io-pause-3-8-aP7uYsw5XCmoDy5W,Command:[],Args:[],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:kube-api-access-bx6jd,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod rs-bx6tp_e2e-disruption-1729(e46c8578-3e2a-40b3-b3dd-4b23b696e7e0): ErrImagePull: rpc error: code = Unknown desc = reading manifest e2e-28-registry-k8s-io-pause-3-8-aP7uYsw5XCmoDy5W in host1.cluster9.ocpci.eng.rdu2.redhat.com:5000/localimages/local-test-image: manifest unknown: manifest unknown
      

      This ErrImagePull with manifest unknown looks to be the root of the problem to me.

      Affecting at least:
      job=periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-ovn-ipv6
      job=periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-sdn-bm

      Looks to have started yesterday around 7pm EST:
      https://testgrid.k8s.io/redhat-openshift-ocp-release-4.12-blocking#periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-ovn-ipv6

      But also happened in a batch a couple days ago, oct 27-28.

              rhn-engineering-dgoodwin Devan Goodwin
              rhn-engineering-dgoodwin Devan Goodwin
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: