Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-30967

OLM caused kube-apiserver, and consequentially etcd, outage due to API requests overflood.

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Normal Normal
    • None
    • 4.13
    • OLM
    • Important
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      OLM caused kube-apiserver, and consequentially etcd, outage due to API requests overflood. 

      Version-Release number of selected component (if applicable):

      4.13.34

      Actual results (using audit.log.tar.gz from 3/12/2024 2:27 PM):

      $ zcat audit.log.tar.gz  | jq -Rr 'fromjson? | select(.requestReceivedTimestamp | contains("2024-03-11")) | (.responseStatus.code|tostring) + " " +  " [" + .user.username + "] " + (.verb|ascii_upcase) + " " + .requestURI'  | sort | uniq -c | sort -n  | tail -5 
      36827 200  [system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount] LIST /apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations?labelSelector=olm.owner%3Dcryostat-operator.v2.4.0-3%2Colm.owner.kind%3DClusterServiceVersion%2Colm.owner.namespace%3Dcryostat
      36829 200  [system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount] LIST /apis/admissionregistration.k8s.io/v1/validatingwebhookconfigurations?labelSelector=olm.owner%3Dcryostat-operator.v2.4.0-3%2Colm.owner.kind%3DClusterServiceVersion%2Colm.owner.namespace%3Dcryostat
      36884 200  [system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount] LIST /apis/operators.coreos.com/v1/namespaces/cryostat/operatorgroups
      37052 404  [system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount] GET /apis/operators.coreos.com/v1alpha1/namespaces/cryostat/clusterserviceversions/cryostat-operator.v2.4.0
      39703 201  [system:serviceaccount:openshift-apiserver:openshift-apiserver-sa] CREATE /apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s
      
      $ zcat audit.log.tar.gz  | jq -Rr 'fromjson? | select(.requestReceivedTimestamp | contains("2024-03-11") ) | select(.requestURI=="/apis/operators.coreos.com/v1alpha1/namespaces/cryostat/clusterserviceversions/cryostat-operator.v2.4.0")' | wc -l                                                   
       37052
      
      // ** Showing the first and the last time
            the request was executed (doing this to get the time range) **
      
      $ zcat audit.log.tar.gz  | jq -Rr 'fromjson? | select(.requestReceivedTimestamp | contains("2024-03-11") ) | select(.user.username=="system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount") | .requestReceivedTimestamp' -r | egrep -o "2024-03-11T.*:.*.:" | uniq | sed -e 1b -e '$!d'
      2024-03-11T11:39:
      2024-03-11T16:36:

      From this output, we see OLM executing ~2 requests per second only trying to reach the non-existent (as it receives 404) endpoint /apis/operators.coreos.com/v1alpha1/namespaces/cryostat/clusterserviceversions/cryostat-operator.v2.4.0, and ~10 requests per second if we summarize all the ones related to cryostat.

      Expected results:

      OLM should pool these resources via API at more relaxed intervals to avoid throwing down kube-apiserver.

              krizza@redhat.com Kevin Rizza
              rhn-support-gmeghnag Gabriel Meghnagi
              Jian Zhang Jian Zhang
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: