Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-1335

Investigate Pathological FailedScheduling Events on GCP & OVN

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • None
    • False
    • None
    • False

      Component Readiness shows 8 regressed components on ovn amd64 gcp.

      Seems to mostly lead back to pathological events tests from several components:

      [sig-arch] events should not repeat pathologically for ns/openshift-authentication
      [sig-arch] events should not repeat pathologically for ns/openshift-dns
      [sig-arch] events should not repeat pathologically for ns/openshift-controller-manager
      [sig-arch] events should not repeat pathologically
      [sig-arch] events should not repeat pathologically for ns/openshift-ovn-kubernetes

      We thought this was TRT-1334 and the linked OCPBUG but it may be something seprate.

      The test may have begun to degrade around Oct 14, but we don't have great visibility into the pass rates before that.

      For a sample job run we analyzed: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade/1717050802796761088
      Looking into the job runs that are failing we see batches of these pathological events near the end of the upgrade spyglass chart.

      The pathological events are always of the form:

      event happened 22 times, something is wrong: ns/openshift-controller-manager pod/controller-manager-7bfb568887-dxpv9 hmsg/059c489b5a - pathological/true reason/FailedScheduling 0/6 nodes are available: 2 node(s) didn't match Pod's node affinity/selector, 2 node(s) didn't match pod anti-affinity rules, 2 node(s) were unschedulable. preemption: 0/6 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 4 Preemption is not helpful for scheduling.. From: 07:35:29Z To: 07:35:30Z result=reject 
      

      Indicating a node scheduling problem.

      We have found that these events are quite common on successful runs, however they typically didn't surpass the pathological limit of 20. This could be caused by the kube scheduler beginning to try more often, or longer node updates, but we checked the latter and the node update time looks the same on successful runs prior to Oct 13.

      Investigate if the scheduler retry frequency has changed. (checking a 4.14 graph might help)

      Check if these also happen on other clouds. (but not pathologically)

      Once deemed ok, we should update origin to ignore these FailedScheduling events if they are overlapped by a NodeUpdate interval.

      These intervals appear to happen all over on all clouds, but they don't hit the pathological limit of 20 normally: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%7B%22datasource%22:%22PCEB727DF2F34084E%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22PCEB727DF2F34084E%22%7D,%22editorMode%22:%22code%22,%22expr%22:%22%7Btype%3D%5C%22origin-interval%5C%22,invoker%3D~%5C%22.%2Aaws.%2A%5C%22%7D%20%7C~%20%5C%22FailedScheduling%5C%22%20%7C~%20%5C%22apiserver%5C%22%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D

              rhn-engineering-dgoodwin Devan Goodwin
              rhn-engineering-dgoodwin Devan Goodwin
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: