-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.14.z
-
Important
-
None
-
False
-
Description of problem:
Having https://docs.openshift.com/container-platform/4.16/security/encrypting-etcd.html configured and therefore seeing data re-encryption taking place about every 7 days. Once in a while we can see a massive spike in CPU and Memory usage within kube-apiserver when re-encryption is happning. This is troublesome as massive amount of Money is burned to accomandate to this conditon and prevent issues from happpening, while during regular operation only very littled load is observed. While checking further, it seems that it may be related to https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/3157-watch-list/README.md and therefore https://issues.redhat.com/browse/API-1378. But it also seems that https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/2340-Consistent-reads-from-cache and subsequentially https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/4192-svm-in-tree#garbage-collection-cache may help to address the problem in the long run. Short term, it may be possible to limit the impact, by reducing the list page size to a lower value. This change would need to be proposed upstream, before it can be applied within OpenShift Container Platform 4. This Bug is aiming to track the effort around reducing the list page size to workaround the problem until the above mentioned improvements become available in OpenShift Container Platform 4.
Version-Release number of selected component (if applicable):
Red Hat OpenShift Container Platform 4 (seen it across different version)
How reproducible:
Random
Steps to Reproduce:
1. Enable etcd data encryption as per https://docs.openshift.com/container-platform/4.16/security/encrypting-etcd.html 2. Fill etcd with massive amount of secret and ConfigMap 3. Wait for re-encryption to happen and watch resource usage during the re-encryption, specifically for kube-apiserver
Actual results:
Massive spike in CPU and Memory usage, causing to hit resource limitation. Hitting resource limitation will cause components to restart and therefore impact overall availability of the OpenShift Container Platform 4 - Control-Plane.
Expected results:
etcd data re-encryption should not trigger additional load or resource spike as it should go unnoticed by end user and not bring down the OpenShift Container Platform 4 - Control-Plane. Further, it's not feasible to size the OpenShift Container Platform 4 - Control-Plane Node(s) for those spikes as it would be a waste of money and resources.
Additional info:
- is related to
-
OCPSTRAT-1344 [API] Support soft-rotation of ETCD datastore encryption
- Backlog