Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-44378

ClusterResourceOverride Operator creating high quantity of clusterresourceoverride-token secrets (57k) in the OCP cluster

XMLWordPrintable

    • Important
    • No
    • 0
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, when the Cluster Resource Override Operator was unable to completely deploy its operand controller, the Operator would restart the process. Each time the Operator attempted the deployment process, the Operator created a new set of secrets. This resulted in a large number of secrets created in the namespace where the Cluster Resource Override Operator was deployed. With this release, the fixed version correctly processes the service account annotations and only one set of secrets is created. (link:https://issues.redhat.com/browse/OCPBUGS-44378[*OCPBUGS-44378*])
      Show
      * Previously, when the Cluster Resource Override Operator was unable to completely deploy its operand controller, the Operator would restart the process. Each time the Operator attempted the deployment process, the Operator created a new set of secrets. This resulted in a large number of secrets created in the namespace where the Cluster Resource Override Operator was deployed. With this release, the fixed version correctly processes the service account annotations and only one set of secrets is created. (link: https://issues.redhat.com/browse/OCPBUGS-44378 [* OCPBUGS-44378 *])
    • Bug Fix
    • Done

      This is a clone of issue OCPBUGS-44351. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-42718. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-15462. The following is the description of the original issue:

      Description of problem:

      The ClusterResourceOverride operator is creating constantly a high quantity of secrets in the etcd from OCP crashing the utilization of the cluster. Due this peak of memory caused by these requests, ~60 containers from master node change status to exit due OOM-killer
      
      After the upgrade from 4.10 to 4.12, we could note that the etcd objects from the cluster was increasing fastly. They increased from 500MB to 2.5GB in 2 days. 
      
      When we checked the etcd objects, we could identify that the ClusterResourceOverride operator was creating a huge quantity of secrets without identified cause: 
      
      
      Old report of objects in the cluster: 
      
      [NUMBER OF OBJECTS IN ETCD] 
      273139 events
        38353 secrets <<<<<<<
      
      When we checked the secrets in the oc command line, we receive a quantity of 1.5k secrets. 
      
      
      We can observe in the master that the high utilization is coming from the API SERVER: 
      
      
      Top MEM-using processes: 
          USER      PID    %CPU  %MEM  VSZ-MiB  RSS-MiB  TTY    STAT   START  TIME  COMMAND  
          root      52188  133   38.9  26447    18792    ?      -      10:36  3:05  kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --
      
      
      We can also observe these messages in the kube-apiserver: 
      
      /kube-apiserver/kube-apiserver/logs/previous.log:2023-06-22T07:11:17.036606087Z I0622 07:11:17.036542      18 trace.go:205] Trace[1639443765]: "Get" url:/api/v1/namespaces/openshift-clusterresourceoverride/secrets/clusterresourceoverride-token-xxx,user-agent:kube-controller-manager/v1.25.8+37a9a08 (linux/amd64) kubernetes/df8a1b9/system:serviceaccount:kube-system:generic-garbage-collector,audit-id:aaef554b-06dd-4801-94e8-601b5ce75a10,client:10.127.206.24,accept:application/vnd.kubernetes.protobuf;as=PartialObjectMetadata;g=meta.k8s.io;v=v1,application/json;as=PartialObjectMetadata;g=meta.k8s.io;v=v1,application/json,protocol:HTTP/2.0 (22-Jun-2023 07:10:21.424) (total time: 55611ms):
      
      kube-apiserver/kube-apiserver/logs/previous.log:2023-06-22T07:11:17.037224152Z E0622 07:11:17.036738      18 timeout.go:141] post-timeout activity - time-elapsed: 1.417764462s, GET "/api/v1/namespaces/openshift-clusterresourceoverride/secrets/clusterresourceoverride-token-nmzvt" result: <nil>
      
      
      
      

       

      Version-Release number of selected component (if applicable):

       

      How reproducible:

       

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

      High quantity of secrets in the etcd

      Expected results:

      Stopping the requests of adding secrets

      Additional info:

      In this case, the customer disabled the operator.

              joelsmith.redhat Joel Smith
              openshift-crt-jira-prow OpenShift Prow Bot
              Aditi Sahay Aditi Sahay
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: