-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.19
-
None
-
Rejected
-
False
-
Prow job mgmt cluster run SingleReplica for density and cost reasons. The downside is that the mgmt KAS is subject to disruption due to any number of reasons (root CI node scaling, eviction, preemption, etc).
Most common components to lose their lease are CAPI and CPO.
E0313 06:08:36.583920 1 leaderelection.go:340] Failed to update lock optimitically: Put "https://172.29.0.1:443/apis/coordination.k8s.io/v1/namespaces/e2e-clusters-l7nl5-proxy-zvks7/leases/controller-leader-elect-capa": context deadline exceeded, falling back to slow path E0313 06:08:36.584009 1 leaderelection.go:347] error retrieving resource lock e2e-clusters-l7nl5-proxy-zvks7/controller-leader-elect-capa: client rate limiter Wait returned an error: context deadline exceeded I0313 06:08:36.584018 1 leaderelection.go:285] failed to renew lease e2e-clusters-l7nl5-proxy-zvks7/controller-leader-elect-capa: timed out waiting for the condition E0313 06:08:36.584065 1 logger.go:99] "problem running manager" err="leader election lost" logger="setup"
I think the best solution is to modify EnsureNoCrashingPods check to look for "leader election lost" in the last lines of the pod log and not consider that a failure.