Loading...

XML

Word

Printable

Root Cause:

We saw the following error, API was timing out

Internal Error occurred: resource quota evaluation timed out.

The cluster had many ClusterResourceQuota objects, and some of them selected >100 projects

The official OpenShift doc mentions this:

Selecting more than 100 projects under a single multi-project quota may have detrimental effects on API server responsiveness in those projects

The evaluation timeout is happening here:

Action Items:

Monitor the evaluation latency (we should utilize the existing admission plugin metrics) and raise an alert (maybe at Warning label) to warn the customer beforehand
The number of projects selected by a ClusterResourceQuota is tracked inside the Status of the object, so we can collect data on how many clusters may be affected by this issue via telemetry/insights
There might be a few areas in the code that we could optimize, looks like the evaluation is done asynchronously by a pool of goroutines.