-
Bug
-
Resolution: Not a Bug
-
Normal
-
None
-
4.10.z
-
Moderate
-
No
-
Rejected
-
False
-
-
Customer Escalated
-
Description of problem:
Kube-api-server pods causing high CPU utilization after an upgrade
Version-Release number of selected component (if applicable):
4.10.55
How reproducible:
Frequently
Steps to Reproduce:
1. Upgrade an OCP cluster from 4.10.41 to 4.10.55 2. Upgrade completes successfully 3.
Actual results:
Kube-api-server pods cause high memory utilization on the node. kube-api-servers pods were restarting as well causing instability in the cluster. In addition cluster was seeing egreeeIP flapping issues.
Expected results:
Upgrade should have been smooth
Additional info:
0. Cu did the same upgrade in the env on other clusters, but they did not face any issues (They were advised to upgrade to OCP 4.10.55 since they were seeing egressIP issues on their OCP 4.10.41 clusters). 1. Customer has three master nodes (5l, 6l, and 7l) in this cluster 2. After upgrade, When they noticed high CPU utilization on 6l (causing it to restart and make cluster unstable), they migrated the VM to another ESX host and after that it was stable. 61 is currently the etcd leader. 3. But soon after they started seeing the same issue on 7l (including the api server pod restarting) 4. We noticed authentication failure messages in kube-rbac-proxy as well as in api server pods (both 7l and 5l). Restarting helped to resovle that issue in dns pod but API server pods still have those errors, but it seems they are unrelated to the issue. 5. etcd pods have had some restarts but are stable and are stable when kube API server pods are spinning more CPU. 6. We have restarted the nodel 7l, and after restart it seems it's stable, but "adm top node/pod" still show the node spiking to 100% every now and then (varies from 9% - 100%). at the same time we see kube-api server pod shooting from few hundred milli core to 2 cores. Surprisingly, Prometheous dashboard shows that majority of API requests are being handled by 5l and 6l. 7. Customer is concerned that the api server pod on 7l may start restarting again causing some instability when their business opens Monday (Apr 24th) AM APAC time, so want to understand why the CPU utilization is spiking sometimes. 8. sosreport for the node 7l and pprof data for the kube-api container running on 7l are uploaded to the support case. 9. sosreport does not have much info to point out why high CPU utilization on the node / kube API server pod.