Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.10.z
Component/s: Unknown
Labels:
- api-triage-customer
- hh1

Activity Type:
Incidents & Support
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
None

Customer Impact:

Customer Escalated
RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Priority Data:
PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Kube-api-server pods causing high CPU utilization after an upgrade

Version-Release number of selected component (if applicable):

4.10.55

How reproducible:

Frequently

Steps to Reproduce:

1. Upgrade an OCP cluster from 4.10.41 to 4.10.55
2. Upgrade completes successfully
3.

Actual results:

Kube-api-server pods cause high memory utilization on the node. kube-api-servers pods were restarting as well causing instability in the cluster. In addition cluster was seeing egreeeIP flapping issues.

Expected results:

Upgrade should have been smooth

Additional info:

0. Cu did the same upgrade in the env on other clusters, but they did not face any issues (They were advised to upgrade to OCP 4.10.55 since they were seeing egressIP issues on their OCP 4.10.41 clusters).
1. Customer has three master nodes (5l, 6l, and 7l) in this cluster
2. After upgrade, When they noticed high CPU utilization on 6l (causing it to restart and make cluster unstable), they migrated the VM to another ESX host and after that it was stable. 61 is currently the etcd leader.
3. But soon after they started seeing the same issue on 7l (including the api server pod restarting)
4. We noticed authentication failure messages in kube-rbac-proxy as well as in api server pods (both 7l and 5l). Restarting helped to resovle that issue in dns pod but API server pods still have those errors, but it seems they are unrelated to the issue.
5. etcd pods have had some restarts but are stable and are stable when kube API server pods are spinning more CPU.
6. We have restarted the nodel 7l, and after restart it seems it's stable, but "adm top node/pod" still show the node spiking to 100% every now and then (varies from 9% - 100%). at the same time we see kube-api server pod shooting from few hundred milli core to 2 cores. Surprisingly, Prometheous dashboard shows that majority of API requests are being handled by 5l and 6l.
7. Customer is concerned that the api server pod on 7l may start restarting again causing some instability when their business opens Monday (Apr 24th) AM APAC time, so want to understand why the CPU utilization is spiking sometimes.
8. sosreport for the node 7l and pprof data for the kube-api container running on 7l are uploaded to the support case.
9. sosreport does not have much info to point out why high CPU utilization on the node / kube API server pod.

Assignee:: Brenton Leanhardt

Reporter:: Anand Paladugu

Need Info From:: None

Contributors:: None

QA Contact:: Ke Wang

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2023/04/23 6:05 PM

Updated:: 2025/09/12 8:56 PM

Resolved:: 2023/05/04 4:21 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide