Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.12
Component/s: Test Framework
Labels:
- ci-incident
- trt

Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Noticed the mass failures in: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.12.0-0.nightly/release/4.12.0-0.nightly-2022-10-31-232349

So far just one payload so may resolve on it's own, but for now considering it is one as it looks very broken.

From the above link these two jobs are good candidates for investigation:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-ovn-ipv6/1587351295310696448

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-sdn-bm/1587335377918627840

Using the spyglass charts in both of the above links you'll see lots of pods stuck in pending, as well as alerts related to PodDisruptionLimit and PodStartupStorageOperationsFailing.

Tracing one of the pods stuck pending down to the node it's on (visible in spyglass), then that hosts systemd journal, we see something like:

Nov 01 09:22:44.912411 worker-0.ostest.test.metalkube.org kubenswrapper[3387]: E1101 09:22:44.912401    3387 kuberuntime_manager.go:862] container &Container{Name:donothing,Image:virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-28-registry-k8s-io-pause-3-8-aP7uYsw5XCmoDy5W,Command:[],Args:[],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:kube-api-access-6fr4z,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod pod-0_e2e-disruption-7215(a28ffad0-d526-4add-9a6b-7b5a2780b6a0): ErrImagePull: rpc error: code = Unknown desc = reading manifest e2e-28-registry-k8s-io-pause-3-8-aP7uYsw5XCmoDy5W in virthost.ostest.test.metalkube.org:5000/localimages/local-test-image: manifest unknown: manifest unknown

Or:

Nov 01 08:13:53.817430 host5.cluster9.ocpci.eng.rdu2.redhat.com kubenswrapper[2385]: E1101 08:13:53.817387    2385 kuberuntime_manager.go:862] container &Container{Name:donothing,Image:host1.cluster9.ocpci.eng.rdu2.redhat.com:5000/localimages/local-test-image:e2e-28-registry-k8s-io-pause-3-8-aP7uYsw5XCmoDy5W,Command:[],Args:[],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:kube-api-access-bx6jd,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod rs-bx6tp_e2e-disruption-1729(e46c8578-3e2a-40b3-b3dd-4b23b696e7e0): ErrImagePull: rpc error: code = Unknown desc = reading manifest e2e-28-registry-k8s-io-pause-3-8-aP7uYsw5XCmoDy5W in host1.cluster9.ocpci.eng.rdu2.redhat.com:5000/localimages/local-test-image: manifest unknown: manifest unknown

This ErrImagePull with manifest unknown looks to be the root of the problem to me.

Affecting at least:
job=periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-ovn-ipv6
job=periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-sdn-bm

Looks to have started yesterday around 7pm EST:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.12-blocking#periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-ovn-ipv6

But also happened in a batch a couple days ago, oct 27-28.

relates to

TRT-656 No 4.12 Nightly Payloads for 8 Days

Closed

links to

slack thread for discussion

Assignee:: Devan Goodwin

Reporter:: Devan Goodwin

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2022/11/01 11:54 AM

Updated:: 2024/04/29 5:00 PM

Resolved:: 2022/11/04 1:04 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates