-
Sub-task
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
False
-
None
-
False
-
Impediment
We worked on a customer escalation https://issues.redhat.com/browse/OCPBUGS-2514.
Root Cause:
We saw the following error, API was timing out
Internal Error occurred: resource quota evaluation timed out.
The cluster had many ClusterResourceQuota objects, and some of them selected >100 projects
The official OpenShift doc mentions this:
Selecting more than 100 projects under a single multi-project quota may have detrimental effects on API server responsiveness in those projects
https://docs.openshift.com/container-platform/3.11/admin_guide/multiproject_quota.html
The evaluation timeout is happening here:
- https://github.com/openshift/apiserver-library-go/blob/925452e8316c91f03b0a035da38beb4b29cb0664/pkg/admission/quota/clusterresourcequota/admission.go#L114-L120
- OpenShift reuses the upstream evaluator: https://github.com/kubernetes/apiserver/blob/5fc71d89e53bb66e1d90c58026ec9dc519596b55/pkg/admission/plugin/resourcequota/controller.go#L624
Action Items:
- Monitor the evaluation latency (we should utilize the existing admission plugin metrics) and raise an alert (maybe at Warning label) to warn the customer beforehand
- The number of projects selected by a ClusterResourceQuota is tracked inside the Status of the object, so we can collect data on how many clusters may be affected by this issue via telemetry/insights
- There might be a few areas in the code that we could optimize, looks like the evaluation is done asynchronously by a pool of goroutines.