Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-25879

kube-controller-manager timeout exceeded by validating webhook timeout leading to degraded cluster state

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • 4.13, 4.12, 4.14
    • None
    • Important
    • No
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      Having a validating webhook timeout exceeding the hard coded kube-controller-manager timeout of 5 seconds, has the kube-controller-manager pods in a crashloopbackup  state in a continuous leaderelection loop - stalling the cluster completely as describe in 
      kube-controller-manager timeout is exceeded by validating webhook during CNI restart leading to degraded cluster state

      Version-Release number of selected component (if applicable):

          

      How reproducible:

      Add a validating webhook with a timeout longer than 5 seconds AND have it failed    

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

      Cluster is stalled - kube-controller-manager pods are in crashloopback continuously failing leaderelection
      - Pods are deleted but are not being re-created automatically by the operator or daemonset.
      - openshift-apiserver pods are crash-looping, but openshift-kube-apiserver pods are in RUNNING/available state.
      - The API appears to be stalling out on requests to create new resources but deleting resources can be completed successfully immediately.
      - ETCD appears healthy and is not in READ-ONLY state.
      - Master nodes are in READY and API/API-INT is reachable from both bastion and master nodes consistently (API not flapping).
        

      Expected results:

      Cluster shouldn't fail

      Additional info:

      kube-controller-manager pods logs are showing the following message repeatedly:
      ~~~
      2023-12-12T14:29:59.457408575Z E1212 14:29:59.457354       1 leaderelection.go:367] Failed to update lock: Put "https://api-int.example.com:6443/api/v1/namespaces/kube-system/configmaps/kube-controller-manager?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
      ~~~

      Attachments

        Activity

          People

            fkrepins@redhat.com Filip Krepinsky
            rhn-support-igreen Ilan Green
            ying zhou ying zhou
            Votes:
            3 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated: