Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.14.z
Component/s: kube-apiserver
Labels:

Severity:
Important
Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Review Complete:

Description of problem:

Having https://docs.openshift.com/container-platform/4.16/security/encrypting-etcd.html configured and therefore seeing data re-encryption taking place about every 7 days. Once in a while we can see a massive spike in CPU and Memory usage within kube-apiserver when re-encryption is happning. This is troublesome as massive amount of Money is burned to accomandate to this conditon and prevent issues from happpening, while during regular operation only very littled load is observed.

While checking further, it seems that it may be related to https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/3157-watch-list/README.md and therefore https://issues.redhat.com/browse/API-1378. But it also seems that https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/2340-Consistent-reads-from-cache and subsequentially https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/4192-svm-in-tree#garbage-collection-cache may help to address the problem in the long run.

Short term, it may be possible to limit the impact, by reducing the list page size to a lower value. This change would need to be proposed upstream, before it can be applied within OpenShift Container Platform 4.

This Bug is aiming to track the effort around reducing the list page size to workaround the problem until the above mentioned improvements become available in OpenShift Container Platform 4.

Version-Release number of selected component (if applicable):

Red Hat OpenShift Container Platform 4 (seen it across different version)

How reproducible:

Random

Steps to Reproduce:

1. Enable etcd data encryption as per https://docs.openshift.com/container-platform/4.16/security/encrypting-etcd.html
2. Fill etcd with massive amount of secret and ConfigMap
3. Wait for re-encryption to happen and watch resource usage during the re-encryption, specifically for kube-apiserver

Actual results:

Massive spike in CPU and Memory usage, causing to hit resource limitation. Hitting resource limitation will cause components to restart and therefore impact overall availability of the OpenShift Container Platform 4 - Control-Plane.

Expected results:

etcd data re-encryption should not trigger additional load or resource spike as it should go unnoticed by end user and not bring down the OpenShift Container Platform 4 - Control-Plane. Further, it's not feasible to size the OpenShift Container Platform 4 - Control-Plane Node(s) for those spikes as it would be a waste of money and resources.

Additional info:

is related to

OCPSTRAT-1344 [API] Support soft-rotation of ETCD datastore encryption

Backlog

Assignee:: Unassigned

Reporter:: Simon Reber

QA Contact:: Ke Wang

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/08/15 11:38 AM

Updated:: 2025/02/20 7:48 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates