Description of problem:
All pods that use the leader election mechanism are constantly crashing, all of them with error: leader election lost.
For example, the pods that use this leader election mechanism are experiencing multiple restarts:
$ omc get pods -n openshift-gitops-operator NAME READY STATUS RESTARTS AGE openshift-gitops-operator-controller-manager-6747fcbcbd-4gd7x 2/2 Running 124 37d ---- $ omc get pods -n openshift-servicemesh-operators NAME READY STATUS RESTARTS AGE ... istio-operator-85fbbdbf6c-ctgrm 1/1 Running 126 37d ---- $ omc get pods -n openshift-tempo-operator NAME READY STATUS RESTARTS AGE tempo-operator-controller-87c5b548-cmmcn 2/2 Running 119 37d
There could be other pods, but basically, all pods that use the leader election mechanism are constantly crashing, all of them with error: leader election lost, for example:
$ omc logs -n openshift-gitops-operator openshift-gitops-operator-controller-manager-6747fcbcbd-4gd7x -c manager --previous 2025-06-03T02:52:34.238152004Z E0603 02:52:34.238095 1 leaderelection.go:369] Failed to update lock: Put "https://172.23.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-gitops-operator/leases/2b63967d.openshift.io": context deadline exceeded 2025-06-03T02:52:34.238152004Z I0603 02:52:34.238131 1 leaderelection.go:285] failed to renew lease openshift-gitops-operator/2b63967d.openshift.io: timed out waiting for the condition 2025-06-03T02:52:34.238217143Z 2025-06-03T02:52:34Z ERROR setup problem running manager {"error": "leader election lost"} 2025-06-03T02:52:34.238217143Z main.main --------- $ omc logs -n openshift-servicemesh-operators istio-operator-85fbbdbf6c-ctgrm -c istio-operator --previous ... 2025-06-03T02:52:32.845603962Z E0603 02:52:32.845505 1 leaderelection.go:356] Failed to update lock: Put "https://172.23.0.1:443/api/v1/namespaces/openshift-servicemesh-operators/configmaps/istio-operator-lock": context deadline exceeded 2025-06-03T02:52:32.845603962Z I0603 02:52:32.845595 1 leaderelection.go:277] failed to renew lease openshift-servicemesh-operators/istio-operator-lock: timed out waiting for the condition
The errors mentioned in this KCS https://access.redhat.com/solutions/7008349 are present in this cluster
$ omc logs kube-controller-manager-map-prod-noe-1-aro-tq44g-master-0 -n openshift-kube-controller-manager -c kube-controller-manager | grep -i 'error retrieving' 2025-06-03T06:09:07.191223781Z E0603 06:09:07.191174 1 leaderelection.go:332] error retrieving resource lock kube-system/kube-controller-manager: Get "https://api-int.noe-1.prod.map.internal.tech-05.net:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=6s": net/http: request canceled (Client.Timeout exceeded while awaiting headers) 2025-06-03T07:19:27.503904194Z E0603 07:19:27.503868 1 leaderelection.go:332] error retrieving resource lock kube-system/kube-controller-manager: Get "https://api-int.noe-1.prod.map.internal.tech-05.net:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=6s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Asked the customer to run the recommended DNS tests on the mentioned KCS -> https://access.redhat.com/solutions/7008349 but the DNS response times are within normal values, for example:
sh-5.1# curl -ks -w "\nTOTAL TIME: %{time_total}s DNS TIME: %{time_namelookup}\n" https://api-int.noe-1.prod.map.internal.tech-05.net:6443 { "kind": "Status", "apiVersion": "v1", "metadata": {}, "status": "Failure", "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"", "reason": "Forbidden", "details": {}, "code": 403 } TOTAL TIME: 0.024861s DNS TIME: 0.000300
Version-Release number of selected component (if applicable):
4.16.37
How reproducible:
Currently present
Expected Results:
To determine why all the pods that use the leader election mechanism are constantly crashing, and what could be done to resolve the issue.
Additional information:
This issue is present on an ARO cluster, we created an internal ticket for ARO SRE to review, but they suggested to reach out to ETCD experts, one of the conclussions from the ETCD and Shift SMEs team was to increase the ring buffer size on the interfaces, but ARO SRE informed that changing this value as laid out in the mentioned KCS [1] for the control plane is untested and not common on our managed platform. We've checked and this seems to be a default value on all the different family SKUs of Azure VMs (v3, v5, etc).
[1] https://access.redhat.com/solutions/5637801