[OCPBUGS-53083] CI flake: EnsureNoCrashingPods fails on pod that lose their leader lease - Red Hat Issue Tracker

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.19
Component/s: HyperShift
Labels:
- triaged

Regression:
None
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.19.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Prow job mgmt cluster run SingleReplica for density and cost reasons. The downside is that the mgmt KAS is subject to disruption due to any number of reasons (root CI node scaling, eviction, preemption, etc).

Most common components to lose their lease are CAPI and CPO.

E0313 06:08:36.583920       1 leaderelection.go:340] Failed to update lock optimitically: Put "https://172.29.0.1:443/apis/coordination.k8s.io/v1/namespaces/e2e-clusters-l7nl5-proxy-zvks7/leases/controller-leader-elect-capa": context deadline exceeded, falling back to slow path
E0313 06:08:36.584009       1 leaderelection.go:347] error retrieving resource lock e2e-clusters-l7nl5-proxy-zvks7/controller-leader-elect-capa: client rate limiter Wait returned an error: context deadline exceeded
I0313 06:08:36.584018       1 leaderelection.go:285] failed to renew lease e2e-clusters-l7nl5-proxy-zvks7/controller-leader-elect-capa: timed out waiting for the condition
E0313 06:08:36.584065       1 logger.go:99] "problem running manager" err="leader election lost" logger="setup"

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn/1900061096438403072

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn/1900099595011100672

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.17-periodics-e2e-aws-ovn/1899971811630649344

I think the best solution is to modify EnsureNoCrashingPods check to look for "leader election lost" in the last lines of the pod log and not consider that a failure.

links to

openshift/hypershift#5830: OCPBUGS-53083: e2e: detect leader election failure in restarted pods

Assignee:: Seth Jennings

Reporter:: Seth Jennings

QA Contact:: Jie Zhao

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/03/13 3:16 PM

Updated:: 2025/04/02 6:37 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates