-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.13, 4.12, 4.14
-
None
-
Important
-
No
-
False
-
Description of problem:
Having a validating webhook timeout exceeding the hard coded kube-controller-manager timeout of 5 seconds, has the kube-controller-manager pods in a crashloopbackup state in a continuous leaderelection loop - stalling the cluster completely as describe in
kube-controller-manager timeout is exceeded by validating webhook during CNI restart leading to degraded cluster state
Version-Release number of selected component (if applicable):
How reproducible:
Add a validating webhook with a timeout longer than 5 seconds AND have it failed
Steps to Reproduce:
1. 2. 3.
Actual results:
Cluster is stalled - kube-controller-manager pods are in crashloopback continuously failing leaderelection - Pods are deleted but are not being re-created automatically by the operator or daemonset. - openshift-apiserver pods are crash-looping, but openshift-kube-apiserver pods are in RUNNING/available state. - The API appears to be stalling out on requests to create new resources but deleting resources can be completed successfully immediately. - ETCD appears healthy and is not in READ-ONLY state. - Master nodes are in READY and API/API-INT is reachable from both bastion and master nodes consistently (API not flapping).
Expected results:
Cluster shouldn't fail
Additional info:
kube-controller-manager pods logs are showing the following message repeatedly:
~~~
2023-12-12T14:29:59.457408575Z E1212 14:29:59.457354 1 leaderelection.go:367] Failed to update lock: Put "https://api-int.example.com:6443/api/v1/namespaces/kube-system/configmaps/kube-controller-manager?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
~~~