-
Story
-
Resolution: Done
-
Undefined
-
None
-
None
-
False
-
None
-
False
-
-
Component Readiness shows 8 regressed components on ovn amd64 gcp.
Seems to mostly lead back to pathological events tests from several components:
[sig-arch] events should not repeat pathologically for ns/openshift-authentication
[sig-arch] events should not repeat pathologically for ns/openshift-dns
[sig-arch] events should not repeat pathologically for ns/openshift-controller-manager
[sig-arch] events should not repeat pathologically
[sig-arch] events should not repeat pathologically for ns/openshift-ovn-kubernetes
Also looks to be hitting:
[sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable after finishing upgrade [Late][Suite:upgrade]
Examples:
Looking into the job runs that are failing we see batches of these pathological events near the end of the upgrade spyglass chart.
The pathological events are always of the form:
event happened 22 times, something is wrong: ns/openshift-controller-manager pod/controller-manager-7bfb568887-dxpv9 hmsg/059c489b5a - pathological/true reason/FailedScheduling 0/6 nodes are available: 2 node(s) didn't match Pod's node affinity/selector, 2 node(s) didn't match pod anti-affinity rules, 2 node(s) were unschedulable. preemption: 0/6 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 4 Preemption is not helpful for scheduling.. From: 07:35:29Z To: 07:35:30Z result=reject
Indicating a node scheduling problem.
Overlapping these events, clusteroperator/machine-config is Available=false with an pretty severe looking error:
condition/Degraded reason/MachineConfigControllerFailed status/True Unable to apply 4.15.0-0.nightly-2023-10-25-052621: ControllerConfig.machineconfiguration.openshift.io \"machine-config-controller\" is invalid: [status.controllerCertificates[0].notAfter: Required value, status.controllerCertificates[0].notBefore: Required value, status.controllerCertificates[1].notAfter: Required value, status.controllerCertificates[1].notBefore: Required value, status.controllerCertificates[2].notAfter: Required value, status.controllerCertificates[2].notBefore: Required value, status.controllerCertificates[3].notAfter: Required value, status.controllerCertificates[3].notBefore: Required value, status.controllerCertificates[4].notAfter: Required value, status.controllerCertificates[4].notBefore: Required value, status.controllerCertificates[5].notAfter: Required value, status.controllerCertificates[5].notBefore: Required value, status.controllerCertificates[6].notAfter: Required value, status.controllerCertificates[6].notBefore: Required value, status.controllerCertificates[7].notAfter: Required value, status.controllerCertificates[7].notBefore: Required value, status.controllerCertificates[8].notAfter: Required value, status.controllerCertificates[8].notBefore: Required value, status.controllerCertificates[9].notAfter: Required value, status.controllerCertificates[9].notBefore: Required value, \u003cnil\u003e: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]
We can see a test flaking indicating this specific problem for:
[bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Available
Sippy indicates a dramatic uptick in flakes for this test somewhere in the last week, from 0.8% to 21% as of today.
Clicking through to the test it appears the problem started Oct 20, possibly late on Oct 19. Failure outputs at the bottom of the test page are all this error.
Problem slipped through aggregation as it's only surfacing on 20-30% of runs.
- is cloned by
-
TRT-1335 Investigate Pathological FailedScheduling Events on GCP & OVN
- Closed
- is related to
-
OCPBUGS-22364 ControllerCertificate struct validation failed during upgrade from 4.14 to 4.15
- Closed