-
Feature Request
-
Resolution: Unresolved
-
Normal
-
None
-
openshift-4.10.z, openshift-4.11.z, openshift-4.12.z
-
None
-
False
-
None
-
False
-
Not Selected
-
Product
-
0
-
0%
-
-
Description of problem:
I observed a broken operator on a customer's OCP cluster which created > 95K Secrets. The large number of objects brought ETCD and the OCP cluster to their knees. The control plane became completely unresponsive. To work around this, master nodes were enlarged to restore some functionality, the rogue operator and its namespace were identified, and a ResourceQuota with `.spec.hard.secrets` was put into place to stop the bleeding while the operator's author was consulted. Perhaps we should consider default ResourceQuotas (like default ulimits in RHEL) that would protect our customers from themselves, broken software, and bad actors. Administrators with the right permissions would be able to thoughtfully increase the Quotas when they actually need a large number of objects.
Version-Release number of selected component (if applicable):
Impacts all OCP clusters, as the limitation is really ETCD
How reproducible:
EZ PZ
Steps to Reproduce:
1. Create lots of objects (Secerts, ConfigMaps, anything...) with some kind of looping mechanism 2. Keep doing this until you have tens of thousands of objects 3. Watch ETCD and the control plane grind to a halt
Actual results:
The cluster struggles as expected
Expected results:
The cluster struggles as expected
Additional info:
This is really just an attempt to get us thinking about how we can mitigate this issue. I figure we can either institute default ResourceQuotas for new Projects, preach the benefits of ResourceQuotas to our customers via documentation or conversations, or something better that someone else will think of :)