Loading...

Type: Story
Resolution: Done
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Labels:
- watcher

Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Component Readiness shows 8 regressed components on ovn amd64 gcp.

Seems to mostly lead back to pathological events tests from several components:

[sig-arch] events should not repeat pathologically for ns/openshift-authentication
[sig-arch] events should not repeat pathologically for ns/openshift-dns
[sig-arch] events should not repeat pathologically for ns/openshift-controller-manager
[sig-arch] events should not repeat pathologically
[sig-arch] events should not repeat pathologically for ns/openshift-ovn-kubernetes

Also looks to be hitting:

[sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable after finishing upgrade [Late][Suite:upgrade]

Examples:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade/1717050802796761088

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade/1715484618926329856

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade/1716412385557745664

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade/1716600654568361984

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade/1716829054398631936

Looking into the job runs that are failing we see batches of these pathological events near the end of the upgrade spyglass chart.

The pathological events are always of the form:

event happened 22 times, something is wrong: ns/openshift-controller-manager pod/controller-manager-7bfb568887-dxpv9 hmsg/059c489b5a - pathological/true reason/FailedScheduling 0/6 nodes are available: 2 node(s) didn't match Pod's node affinity/selector, 2 node(s) didn't match pod anti-affinity rules, 2 node(s) were unschedulable. preemption: 0/6 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 4 Preemption is not helpful for scheduling.. From: 07:35:29Z To: 07:35:30Z result=reject

Indicating a node scheduling problem.

Overlapping these events, clusteroperator/machine-config is Available=false with an pretty severe looking error:

condition/Degraded reason/MachineConfigControllerFailed status/True Unable to apply 4.15.0-0.nightly-2023-10-25-052621: ControllerConfig.machineconfiguration.openshift.io \"machine-config-controller\" is invalid: [status.controllerCertificates[0].notAfter: Required value, status.controllerCertificates[0].notBefore: Required value, status.controllerCertificates[1].notAfter: Required value, status.controllerCertificates[1].notBefore: Required value, status.controllerCertificates[2].notAfter: Required value, status.controllerCertificates[2].notBefore: Required value, status.controllerCertificates[3].notAfter: Required value, status.controllerCertificates[3].notBefore: Required value, status.controllerCertificates[4].notAfter: Required value, status.controllerCertificates[4].notBefore: Required value, status.controllerCertificates[5].notAfter: Required value, status.controllerCertificates[5].notBefore: Required value, status.controllerCertificates[6].notAfter: Required value, status.controllerCertificates[6].notBefore: Required value, status.controllerCertificates[7].notAfter: Required value, status.controllerCertificates[7].notBefore: Required value, status.controllerCertificates[8].notAfter: Required value, status.controllerCertificates[8].notBefore: Required value, status.controllerCertificates[9].notAfter: Required value, status.controllerCertificates[9].notBefore: Required value, \u003cnil\u003e: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]

We can see a test flaking indicating this specific problem for:
[bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Available

Sippy indicates a dramatic uptick in flakes for this test somewhere in the last week, from 0.8% to 21% as of today.

Clicking through to the test it appears the problem started Oct 20, possibly late on Oct 19. Failure outputs at the bottom of the test page are all this error.

Problem slipped through aggregation as it's only surfacing on 20-30% of runs.

is cloned by

TRT-1335 Investigate Pathological FailedScheduling Events on GCP & OVN

Closed

is related to

OCPBUGS-22364 ControllerCertificate struct validation failed during upgrade from 4.14 to 4.15

Closed

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates