-
Bug
-
Resolution: Won't Do
-
Undefined
-
None
-
4.10.z
-
Important
-
None
-
Rejected
-
False
-
Description of problem:
The issue presents as multus failures with timeouts adding containers or deleting containers on build. However, the larger issue is that kube-apiserver pods are showing timeouts on queries and <nil> response to GET/PUT requests, leading to timeouts and failed responses. Not all kube-apipods are experiencing this problem however.
Version-Release number of selected component (if applicable):
OCP 4.10.40
How reproducible:
constantly/every time
Steps to Reproduce:
1. Delete kube-apiserver pods and kube-apiserver-operator pods, wait for recreation. 2. observe that all pods will return to 5/5 READY status, but kube-apiserver container will log (in one master node) the following error messaging (sampling) ~~~ 2022-12-13T19:22:29.390721236Z W1213 19:22:29.390208 17 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to co nnect to {xxx.xxx.xxx.22:2379 xxx.xxx.xxx.22 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp xxx.xxx.xxx.22:2379: connect: connection refused". Reconnecting... 2022-12-13T19:22:29.390721236Z W1213 19:22:29.390285 17 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to co nnect to {xxx.xxx.xxx.22:2379 xxx.xxx.xxx.22 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp xxx.xxx.xxx.22:2379: connect: connection refused". Reconnecting... 2022-12-13T19:22:29.390721236Z W1213 19:22:29.390335 17 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to co nnect to {xxx.xxx.xxx.22:2379 xxx.xxx.xxx.22 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp xxx.xxx.xxx.22:2379: connect: connection refused". Reconnecting... ... 2022-12-13T19:24:40.997279761Z E1213 19:24:40.997097 17 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/bui ld.openshift.io/v1/namespaces/<obfuscated>/buildconfigs" audit-ID="6e33122a-6454-466a-969c-e6c9a7c4d9f8" 2022-12-13T19:24:40.997279761Z E1213 19:24:40.997167 17 timeout.go:141] post-timeout activity - time-elapsed: 3.774µs, GET "/a pis/build.openshift.io/v1/namespaces/<obfuscated>/buildconfigs" result: <nil> ##note that <nil> response seems to fill most of log sampling as it catches up with all the requests that were unable to be completed and were returned with an empty response/no value on query. ... 2022-12-16T15:26:02.253128797Z I1216 15:26:02.253056 17 controller.go:129] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue. ... 2022-12-16T15:27:32.219905116Z E1216 15:27:32.219800 17 controller.go:116] loading OpenAPI spec for "v1beta1.metrics.k8s.io" f ailed with: failed to retrieve openAPI spec, http error: ResponseCode: 500, Body: Internal Server Error: "/openapi/v2": Post "https ://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": net/http: request canceled (Client.Timeout exceed ed while awaiting headers) ~~~ 3. observe new pods scaling in different namespaces/hosts will fail with multus creation error logged below: ~~~ 1h55m Warning FailedCreatePodSandBox pod/kube-apiserver-guard-<node> Fail ed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kube-apiserver-guard-<node>_openshift-kube-apiserver_404a6e71-7458-4a80-ad01-9a69b754df47_0(a700fcf2fe9db2f5bcd4b45de11fb67a582d 89151377548d7017c593f6ba2c3a): error adding pod openshift-kube-apiserver_kube-apiserver-guard-<node> to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): Multus: [openshift-kube-apise rver/kube-apiserver-guard-<node>/404a6e71-7458-4a80-ad01-9a69b754df47]: error getting pod: Get "h ttps://[api-int.<cluster><domain>]:6443/api/v1/namespaces/openshift-kube-apiserver/pods/kube-apiserver-guard-<node>?timeout=1m0s": context deadline exceeded ~~~ ##note that above error with multus is not limited to this namespace, it is every namespace that is talking to this master node trying to create a pod, and the respective pod sample names adjust each time also worth noting -- all nodes/pods can reach the api-int address which is hosted on external loadbalancer, which is forwarding roundrobin to each master node. LB strategy was set to 'source' is now 'roundrobin' to distribute traffic, which seems to have lessened the severity of the issue somewhat, pods now see the multus alert but eventually do create; but 1 master node is still failing to make these calls so whenever the LB distributes a request to this failing host, the processes timeout/fail until it tries again with next-available apiserver. have also increased CPU budget on masters from 4 core to 6 core, memory still at 16gb per master node. (can increase if needed) VMWARE/UPI deployment, external LB running haproxy. ETCD appears healthy, response times are good, all nodes are in sync with raft index, and only about 700mb index size. 12ms response time average on etcdctl health queries. We saw in the logs a stack dump from the previous kube-apiserver container indicating a runtime crash: (attached in next update). Prior to this crash, logs were showing the same as above sampling -- <NIL> response back and timeouts. ~~~ fatal error: found pointer to free object runtime stack: runtime.throw({0x5376f75, 0xc04e89ca00}) /usr/lib/golang/src/runtime/panic.go:1198 +0x71 runtime.(*mspan).reportZombies(0x7fad78cb9700) ~~~ We cleared two 3year old webhooks: validatingwebhookconfiguration.admissionregistration.k8s.io/autoscaling.openshift.io validatingwebhookconfiguration.admissionregistration.k8s.io/multus.openshift.io which upon restart has removed the error messaging from the openshift- kube-apiserver-operator log regarding the failed x509 certificate: ~~~ 2022-12-14T16:54:46.126003323Z E1214 16:54:46.125968 1 degraded_webhook.go:128] x509: certificate signed by unknown authority 2022-12-14T16:54:47.137528106Z E1214 16:54:47.137442 1 degraded_webhook.go:128] x509: certificate signed by unknown authority 2022-12-14T16:54:49.150220726Z E1214 16:54:49.150158 1 degraded_webhook.go:56] machineautoscalers.autoscaling.openshift.io: x509: certificate signed by unknown authority 2022-12-14T16:54:43.092168440Z E1214 16:54:43.092114 1 degraded_webhook.go:128] x509: certificate signed by unknown authority 2022-12-14T16:54:44.103769464Z E1214 16:54:44.103723 1 degraded_webhook.go:128] x509: certificate signed by unknown authority 2022-12-14T16:54:46.116408902Z E1214 16:54:46.116133 1 degraded_webhook.go:56] clusterautoscalers.autoscaling.openshift.io: x509: certificate signed by unknown authority ~~~ ^^ these error messages in kube-apiserver-operator pod now resolved but kube-apiserver pods (restarted fresh afterwards along with operator pod) are still ongoing. need some assistance in debugging possible cause. master nodes are separated by different VHOSTS and do not appear to be massively overutilized via cpu load-averages.
Actual results:
details captured above
Expected results:
kube-apiserver containers should be able to query etcd consistently and not experience timeouts on calls/client disconnects.
Additional info:
also observed at least once that kube-apiserver container experienced a runtime crash failure; dump attached to case below after significant timeout. (note that 25 HOURS were logged as part of this core out).