Loading...

XML

Word

Printable

Type: Story
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Component Readiness shows 8 regressed components on ovn amd64 gcp.

Seems to mostly lead back to pathological events tests from several components:

[sig-arch] events should not repeat pathologically for ns/openshift-authentication
[sig-arch] events should not repeat pathologically for ns/openshift-dns
[sig-arch] events should not repeat pathologically for ns/openshift-controller-manager
[sig-arch] events should not repeat pathologically
[sig-arch] events should not repeat pathologically for ns/openshift-ovn-kubernetes

We thought this was ~~TRT-1334~~ and the linked OCPBUG but it may be something seprate.

The test may have begun to degrade around Oct 14, but we don't have great visibility into the pass rates before that.

For a sample job run we analyzed: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade/1717050802796761088
Looking into the job runs that are failing we see batches of these pathological events near the end of the upgrade spyglass chart.

The pathological events are always of the form:

event happened 22 times, something is wrong: ns/openshift-controller-manager pod/controller-manager-7bfb568887-dxpv9 hmsg/059c489b5a - pathological/true reason/FailedScheduling 0/6 nodes are available: 2 node(s) didn't match Pod's node affinity/selector, 2 node(s) didn't match pod anti-affinity rules, 2 node(s) were unschedulable. preemption: 0/6 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 4 Preemption is not helpful for scheduling.. From: 07:35:29Z To: 07:35:30Z result=reject

Indicating a node scheduling problem.

We have found that these events are quite common on successful runs, however they typically didn't surpass the pathological limit of 20. This could be caused by the kube scheduler beginning to try more often, or longer node updates, but we checked the latter and the node update time looks the same on successful runs prior to Oct 13.

Investigate if the scheduler retry frequency has changed. (checking a 4.14 graph might help)

Check if these also happen on other clouds. (but not pathologically)

Once deemed ok, we should update origin to ignore these FailedScheduling events if they are overlapped by a NodeUpdate interval.

These intervals appear to happen all over on all clouds, but they don't hit the pathological limit of 20 normally: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%7B%22datasource%22:%22PCEB727DF2F34084E%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22PCEB727DF2F34084E%22%7D,%22editorMode%22:%22code%22,%22expr%22:%22%7Btype%3D%5C%22origin-interval%5C%22,invoker%3D~%5C%22.%2Aaws.%2A%5C%22%7D%20%7C~%20%5C%22FailedScheduling%5C%22%20%7C~%20%5C%22apiserver%5C%22%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D

clones

TRT-1334 Machine Config Operator controllerCertificates Validation Error

Closed

links to

openshift/origin#28358: Skip FailedScheduling intervals when masters are in NodeUpdate

Assignee:: Devan Goodwin

Reporter:: Devan Goodwin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2023/10/26 1:57 PM

Updated:: 2023/11/01 1:07 PM

Resolved:: 2023/11/01 1:07 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates