Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-53083

CI flake: EnsureNoCrashingPods fails on pod that lose their leader lease

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.19
    • HyperShift
    • None
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Prow job mgmt cluster run SingleReplica for density and cost reasons. The downside is that the mgmt KAS is subject to disruption due to any number of reasons (root CI node scaling, eviction, preemption, etc).

      Most common components to lose their lease are CAPI and CPO.

      E0313 06:08:36.583920       1 leaderelection.go:340] Failed to update lock optimitically: Put "https://172.29.0.1:443/apis/coordination.k8s.io/v1/namespaces/e2e-clusters-l7nl5-proxy-zvks7/leases/controller-leader-elect-capa": context deadline exceeded, falling back to slow path
      E0313 06:08:36.584009       1 leaderelection.go:347] error retrieving resource lock e2e-clusters-l7nl5-proxy-zvks7/controller-leader-elect-capa: client rate limiter Wait returned an error: context deadline exceeded
      I0313 06:08:36.584018       1 leaderelection.go:285] failed to renew lease e2e-clusters-l7nl5-proxy-zvks7/controller-leader-elect-capa: timed out waiting for the condition
      E0313 06:08:36.584065       1 logger.go:99] "problem running manager" err="leader election lost" logger="setup"
      

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn/1900061096438403072

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn/1900099595011100672

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.17-periodics-e2e-aws-ovn/1899971811630649344

      I think the best solution is to modify EnsureNoCrashingPods check to look for "leader election lost" in the last lines of the pod log and not consider that a failure.

              sjenning Seth Jennings
              sjenning Seth Jennings
              Jie Zhao Jie Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: