Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-57662

Pods that use the leader election mechanism are constantly crashing, all of them with error: leader election lost

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Major Major
    • None
    • 4.16.z
    • Etcd
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

          All pods that use the leader election mechanism are constantly crashing, all of them with error: leader election lost.
      
      

       
      For example, the pods that use this leader election mechanism are experiencing multiple restarts:  

      $ omc get pods -n openshift-gitops-operator
      NAME                                                            READY   STATUS    RESTARTS   AGE
      openshift-gitops-operator-controller-manager-6747fcbcbd-4gd7x   2/2     Running   124        37d
      ----
      $ omc get pods -n openshift-servicemesh-operators
      NAME                              READY   STATUS    RESTARTS   AGE
      ...
      istio-operator-85fbbdbf6c-ctgrm   1/1     Running   126        37d
      ----
      $ omc get pods -n openshift-tempo-operator
      NAME                                       READY   STATUS    RESTARTS   AGE
      tempo-operator-controller-87c5b548-cmmcn   2/2     Running   119        37d

      There could be other pods, but basically, all pods that use the leader election mechanism are constantly crashing, all of them with error: leader election lost, for example: 

      $ omc logs -n openshift-gitops-operator openshift-gitops-operator-controller-manager-6747fcbcbd-4gd7x -c manager --previous
      2025-06-03T02:52:34.238152004Z E0603 02:52:34.238095       1 leaderelection.go:369] Failed to update lock: Put "https://172.23.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-gitops-operator/leases/2b63967d.openshift.io": context deadline exceeded
      2025-06-03T02:52:34.238152004Z I0603 02:52:34.238131       1 leaderelection.go:285] failed to renew lease openshift-gitops-operator/2b63967d.openshift.io: timed out waiting for the condition
      2025-06-03T02:52:34.238217143Z 2025-06-03T02:52:34Z    ERROR    setup    problem running manager    {"error": "leader election lost"}
      2025-06-03T02:52:34.238217143Z main.main
      ---------
      
      $ omc logs -n openshift-servicemesh-operators istio-operator-85fbbdbf6c-ctgrm -c istio-operator --previous
      ...
      2025-06-03T02:52:32.845603962Z E0603 02:52:32.845505       1 leaderelection.go:356] Failed to update lock: Put "https://172.23.0.1:443/api/v1/namespaces/openshift-servicemesh-operators/configmaps/istio-operator-lock": context deadline exceeded
      2025-06-03T02:52:32.845603962Z I0603 02:52:32.845595       1 leaderelection.go:277] failed to renew lease openshift-servicemesh-operators/istio-operator-lock: timed out waiting for the condition

       

      The errors mentioned in this KCS https://access.redhat.com/solutions/7008349  are present in this cluster

      $ omc logs kube-controller-manager-map-prod-noe-1-aro-tq44g-master-0 -n openshift-kube-controller-manager -c kube-controller-manager  | grep -i 'error retrieving'
      2025-06-03T06:09:07.191223781Z E0603 06:09:07.191174       1 leaderelection.go:332] error retrieving resource lock kube-system/kube-controller-manager: Get "https://api-int.noe-1.prod.map.internal.tech-05.net:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=6s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
      2025-06-03T07:19:27.503904194Z E0603 07:19:27.503868       1 leaderelection.go:332] error retrieving resource lock kube-system/kube-controller-manager: Get "https://api-int.noe-1.prod.map.internal.tech-05.net:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=6s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
      
      
      

      Asked the customer to run the recommended DNS tests on the mentioned KCS -> https://access.redhat.com/solutions/7008349 but the DNS response times are within normal values, for example: 

      sh-5.1# curl -ks -w "\nTOTAL TIME: %{time_total}s DNS TIME: %{time_namelookup}\n" https://api-int.noe-1.prod.map.internal.tech-05.net:6443
      {
        "kind": "Status",
        "apiVersion": "v1",
        "metadata": {},
        "status": "Failure",
        "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
        "reason": "Forbidden",
        "details": {},
        "code": 403
      }
      TOTAL TIME: 0.024861s DNS TIME: 0.000300

       

      Version-Release number of selected component (if applicable):

          4.16.37

      How reproducible:

          Currently present

      Expected Results: 

       To determine why all the pods that use the leader election mechanism are constantly crashing, and what could be done to resolve the issue. 
      
      

      Additional information: 

          This issue is present on an ARO cluster, we created an internal ticket for ARO SRE to review, but they suggested to reach out to ETCD experts, one of the conclussions from the ETCD and Shift SMEs team was to increase the ring buffer size on the interfaces, but ARO SRE informed that changing this value as laid out in the mentioned KCS [1] for the control plane is untested and not common on our managed platform. We've checked and this seems to be a default value on all the different family SKUs of Azure VMs (v3, v5, etc). 
      [1] https://access.redhat.com/solutions/5637801

              dwest@redhat.com Dean West
              rhn-support-oariasol Omar Arias Olave
              None
              None
              Ge Liu Ge Liu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: