Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-42718

ClusterResourceOverride Operator creating high quantity of clusterresourceoverride-token secrets (57k) in the OCP cluster

XMLWordPrintable

    • Important
    • No
    • 0
    • False
    • Hide

      None

      Show
      None
    • Hide
      When the Cluster Resource Override Operator was unable to completely deploy its operand controller, it would re-start the process again. Each time it attempted the process it caused a new set of secrets to be created. This resulted in potentially thousands of secrets being created in the namespace where Cluster Resource Override Operator was deployed. The fixed version correctly handles the service account annotations so that only one set of secrets is created.
      Show
      When the Cluster Resource Override Operator was unable to completely deploy its operand controller, it would re-start the process again. Each time it attempted the process it caused a new set of secrets to be created. This resulted in potentially thousands of secrets being created in the namespace where Cluster Resource Override Operator was deployed. The fixed version correctly handles the service account annotations so that only one set of secrets is created.
    • Bug Fix
    • In Progress

      This is a clone of issue OCPBUGS-15462. The following is the description of the original issue:

      Description of problem:

      The ClusterResourceOverride operator is creating constantly a high quantity of secrets in the etcd from OCP crashing the utilization of the cluster. Due this peak of memory caused by these requests, ~60 containers from master node change status to exit due OOM-killer
      
      After the upgrade from 4.10 to 4.12, we could note that the etcd objects from the cluster was increasing fastly. They increased from 500MB to 2.5GB in 2 days. 
      
      When we checked the etcd objects, we could identify that the ClusterResourceOverride operator was creating a huge quantity of secrets without identified cause: 
      
      
      Old report of objects in the cluster: 
      
      [NUMBER OF OBJECTS IN ETCD] 
      273139 events
        38353 secrets <<<<<<<
      
      When we checked the secrets in the oc command line, we receive a quantity of 1.5k secrets. 
      
      
      We can observe in the master that the high utilization is coming from the API SERVER: 
      
      
      Top MEM-using processes: 
          USER      PID    %CPU  %MEM  VSZ-MiB  RSS-MiB  TTY    STAT   START  TIME  COMMAND  
          root      52188  133   38.9  26447    18792    ?      -      10:36  3:05  kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --
      
      
      We can also observe these messages in the kube-apiserver: 
      
      /kube-apiserver/kube-apiserver/logs/previous.log:2023-06-22T07:11:17.036606087Z I0622 07:11:17.036542      18 trace.go:205] Trace[1639443765]: "Get" url:/api/v1/namespaces/openshift-clusterresourceoverride/secrets/clusterresourceoverride-token-xxx,user-agent:kube-controller-manager/v1.25.8+37a9a08 (linux/amd64) kubernetes/df8a1b9/system:serviceaccount:kube-system:generic-garbage-collector,audit-id:aaef554b-06dd-4801-94e8-601b5ce75a10,client:10.127.206.24,accept:application/vnd.kubernetes.protobuf;as=PartialObjectMetadata;g=meta.k8s.io;v=v1,application/json;as=PartialObjectMetadata;g=meta.k8s.io;v=v1,application/json,protocol:HTTP/2.0 (22-Jun-2023 07:10:21.424) (total time: 55611ms):
      
      kube-apiserver/kube-apiserver/logs/previous.log:2023-06-22T07:11:17.037224152Z E0622 07:11:17.036738      18 timeout.go:141] post-timeout activity - time-elapsed: 1.417764462s, GET "/api/v1/namespaces/openshift-clusterresourceoverride/secrets/clusterresourceoverride-token-nmzvt" result: <nil>
      
      
      
      

       

      Version-Release number of selected component (if applicable):

       

      How reproducible:

       

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

      High quantity of secrets in the etcd

      Expected results:

      Stopping the requests of adding secrets

      Additional info:

      In this case, the customer disabled the operator.

              joelsmith.redhat Joel Smith
              openshift-crt-jira-prow OpenShift Prow Bot
              Aditi Sahay Aditi Sahay
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: