-
Bug
-
Resolution: Done
-
Critical
-
None
-
4.12
-
None
-
False
-
Noticed the mass failures in: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.12.0-0.nightly/release/4.12.0-0.nightly-2022-10-31-232349
So far just one payload so may resolve on it's own, but for now considering it is one as it looks very broken.
From the above link these two jobs are good candidates for investigation:
Using the spyglass charts in both of the above links you'll see lots of pods stuck in pending, as well as alerts related to PodDisruptionLimit and PodStartupStorageOperationsFailing.
Tracing one of the pods stuck pending down to the node it's on (visible in spyglass), then that hosts systemd journal, we see something like:
Nov 01 09:22:44.912411 worker-0.ostest.test.metalkube.org kubenswrapper[3387]: E1101 09:22:44.912401 3387 kuberuntime_manager.go:862] container &Container{Name:donothing,Image:virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-28-registry-k8s-io-pause-3-8-aP7uYsw5XCmoDy5W,Command:[],Args:[],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:kube-api-access-6fr4z,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod pod-0_e2e-disruption-7215(a28ffad0-d526-4add-9a6b-7b5a2780b6a0): ErrImagePull: rpc error: code = Unknown desc = reading manifest e2e-28-registry-k8s-io-pause-3-8-aP7uYsw5XCmoDy5W in virthost.ostest.test.metalkube.org:5000/localimages/local-test-image: manifest unknown: manifest unknown
Or:
Nov 01 08:13:53.817430 host5.cluster9.ocpci.eng.rdu2.redhat.com kubenswrapper[2385]: E1101 08:13:53.817387 2385 kuberuntime_manager.go:862] container &Container{Name:donothing,Image:host1.cluster9.ocpci.eng.rdu2.redhat.com:5000/localimages/local-test-image:e2e-28-registry-k8s-io-pause-3-8-aP7uYsw5XCmoDy5W,Command:[],Args:[],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:kube-api-access-bx6jd,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod rs-bx6tp_e2e-disruption-1729(e46c8578-3e2a-40b3-b3dd-4b23b696e7e0): ErrImagePull: rpc error: code = Unknown desc = reading manifest e2e-28-registry-k8s-io-pause-3-8-aP7uYsw5XCmoDy5W in host1.cluster9.ocpci.eng.rdu2.redhat.com:5000/localimages/local-test-image: manifest unknown: manifest unknown
This ErrImagePull with manifest unknown looks to be the root of the problem to me.
Affecting at least:
job=periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-ovn-ipv6
job=periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-sdn-bm
Looks to have started yesterday around 7pm EST:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.12-blocking#periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-ovn-ipv6
But also happened in a batch a couple days ago, oct 27-28.
- relates to
-
TRT-656 No 4.12 Nightly Payloads for 8 Days
- Closed
- links to