-
Bug
-
Resolution: Cannot Reproduce
-
Normal
-
None
-
4.13
-
Important
-
No
-
Rejected
-
False
-
Description of problem:
OLM caused kube-apiserver, and consequentially etcd, outage due to API requests overflood.
Version-Release number of selected component (if applicable):
4.13.34
Actual results (using audit.log.tar.gz from 3/12/2024 2:27 PM):
$ zcat audit.log.tar.gz | jq -Rr 'fromjson? | select(.requestReceivedTimestamp | contains("2024-03-11")) | (.responseStatus.code|tostring) + " " + " [" + .user.username + "] " + (.verb|ascii_upcase) + " " + .requestURI' | sort | uniq -c | sort -n | tail -5 36827 200 [system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount] LIST /apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations?labelSelector=olm.owner%3Dcryostat-operator.v2.4.0-3%2Colm.owner.kind%3DClusterServiceVersion%2Colm.owner.namespace%3Dcryostat 36829 200 [system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount] LIST /apis/admissionregistration.k8s.io/v1/validatingwebhookconfigurations?labelSelector=olm.owner%3Dcryostat-operator.v2.4.0-3%2Colm.owner.kind%3DClusterServiceVersion%2Colm.owner.namespace%3Dcryostat 36884 200 [system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount] LIST /apis/operators.coreos.com/v1/namespaces/cryostat/operatorgroups 37052 404 [system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount] GET /apis/operators.coreos.com/v1alpha1/namespaces/cryostat/clusterserviceversions/cryostat-operator.v2.4.0 39703 201 [system:serviceaccount:openshift-apiserver:openshift-apiserver-sa] CREATE /apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s $ zcat audit.log.tar.gz | jq -Rr 'fromjson? | select(.requestReceivedTimestamp | contains("2024-03-11") ) | select(.requestURI=="/apis/operators.coreos.com/v1alpha1/namespaces/cryostat/clusterserviceversions/cryostat-operator.v2.4.0")' | wc -l 37052 // ** Showing the first and the last time the request was executed (doing this to get the time range) ** $ zcat audit.log.tar.gz | jq -Rr 'fromjson? | select(.requestReceivedTimestamp | contains("2024-03-11") ) | select(.user.username=="system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount") | .requestReceivedTimestamp' -r | egrep -o "2024-03-11T.*:.*.:" | uniq | sed -e 1b -e '$!d' 2024-03-11T11:39: 2024-03-11T16:36:
From this output, we see OLM executing ~2 requests per second only trying to reach the non-existent (as it receives 404) endpoint /apis/operators.coreos.com/v1alpha1/namespaces/cryostat/clusterserviceversions/cryostat-operator.v2.4.0, and ~10 requests per second if we summarize all the ones related to cryostat.
Expected results:
OLM should pool these resources via API at more relaxed intervals to avoid throwing down kube-apiserver.