Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Major
Fix Version/s: None
Affects Version/s: 4.16.z
Component/s: Etcd
Labels:
- api
- etcd

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

    All pods that use the leader election mechanism are constantly crashing, all of them with error: leader election lost.

For example, the pods that use this leader election mechanism are experiencing multiple restarts:

$ omc get pods -n openshift-gitops-operator
NAME                                                            READY   STATUS    RESTARTS   AGE
openshift-gitops-operator-controller-manager-6747fcbcbd-4gd7x   2/2     Running   124        37d
----
$ omc get pods -n openshift-servicemesh-operators
NAME                              READY   STATUS    RESTARTS   AGE
...
istio-operator-85fbbdbf6c-ctgrm   1/1     Running   126        37d
----
$ omc get pods -n openshift-tempo-operator
NAME                                       READY   STATUS    RESTARTS   AGE
tempo-operator-controller-87c5b548-cmmcn   2/2     Running   119        37d

There could be other pods, but basically, all pods that use the leader election mechanism are constantly crashing, all of them with error: leader election lost, for example:

$ omc logs -n openshift-gitops-operator openshift-gitops-operator-controller-manager-6747fcbcbd-4gd7x -c manager --previous
2025-06-03T02:52:34.238152004Z E0603 02:52:34.238095       1 leaderelection.go:369] Failed to update lock: Put "https://172.23.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-gitops-operator/leases/2b63967d.openshift.io": context deadline exceeded
2025-06-03T02:52:34.238152004Z I0603 02:52:34.238131       1 leaderelection.go:285] failed to renew lease openshift-gitops-operator/2b63967d.openshift.io: timed out waiting for the condition
2025-06-03T02:52:34.238217143Z 2025-06-03T02:52:34Z    ERROR    setup    problem running manager    {"error": "leader election lost"}
2025-06-03T02:52:34.238217143Z main.main
---------

$ omc logs -n openshift-servicemesh-operators istio-operator-85fbbdbf6c-ctgrm -c istio-operator --previous
...
2025-06-03T02:52:32.845603962Z E0603 02:52:32.845505       1 leaderelection.go:356] Failed to update lock: Put "https://172.23.0.1:443/api/v1/namespaces/openshift-servicemesh-operators/configmaps/istio-operator-lock": context deadline exceeded
2025-06-03T02:52:32.845603962Z I0603 02:52:32.845595       1 leaderelection.go:277] failed to renew lease openshift-servicemesh-operators/istio-operator-lock: timed out waiting for the condition

The errors mentioned in this KCS https://access.redhat.com/solutions/7008349 are present in this cluster

$ omc logs kube-controller-manager-map-prod-noe-1-aro-tq44g-master-0 -n openshift-kube-controller-manager -c kube-controller-manager  | grep -i 'error retrieving'
2025-06-03T06:09:07.191223781Z E0603 06:09:07.191174       1 leaderelection.go:332] error retrieving resource lock kube-system/kube-controller-manager: Get "https://api-int.noe-1.prod.map.internal.tech-05.net:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=6s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2025-06-03T07:19:27.503904194Z E0603 07:19:27.503868       1 leaderelection.go:332] error retrieving resource lock kube-system/kube-controller-manager: Get "https://api-int.noe-1.prod.map.internal.tech-05.net:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=6s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Asked the customer to run the recommended DNS tests on the mentioned KCS -> https://access.redhat.com/solutions/7008349 but the DNS response times are within normal values, for example:

sh-5.1# curl -ks -w "\nTOTAL TIME: %{time_total}s DNS TIME: %{time_namelookup}\n" https://api-int.noe-1.prod.map.internal.tech-05.net:6443
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
}
TOTAL TIME: 0.024861s DNS TIME: 0.000300

Version-Release number of selected component (if applicable):

    4.16.37

How reproducible:

    Currently present

Expected Results:

 To determine why all the pods that use the leader election mechanism are constantly crashing, and what could be done to resolve the issue.

Additional information:

    This issue is present on an ARO cluster, we created an internal ticket for ARO SRE to review, but they suggested to reach out to ETCD experts, one of the conclussions from the ETCD and Shift SMEs team was to increase the ring buffer size on the interfaces, but ARO SRE informed that changing this value as laid out in the mentioned KCS [1] for the control plane is untested and not common on our managed platform. We've checked and this seems to be a default value on all the different family SKUs of Azure VMs (v3, v5, etc). 
[1] https://access.redhat.com/solutions/5637801

Assignee:: Dean West

Reporter:: Omar Arias Olave

Need Info From:: None

Contributors:: None

QA Contact:: Ge Liu

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/06/18 4:43 PM

Updated:: 2025/07/12 1:11 PM

Resolved:: 2025/07/08 8:32 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates