Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-44351

ClusterResourceOverride Operator creating high quantity of clusterresourceoverride-token secrets (57k) in the OCP cluster

    • Important
    • No
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, when the Cluster Resource Override Operator failed to run its operand controller, the Operator attempted to re-run the controller. Each re-run operation generated a new set of secrets that eventually constrained cluster namespace resources. With this release, the service account for a cluster now includes annotations that prevent the Operator from creating additional secrets when a secret already exists for the cluster. (link:https://issues.redhat.com/browse/OCPBUGS-44351[*OCPBUGS-44351*])
      Show
      * Previously, when the Cluster Resource Override Operator failed to run its operand controller, the Operator attempted to re-run the controller. Each re-run operation generated a new set of secrets that eventually constrained cluster namespace resources. With this release, the service account for a cluster now includes annotations that prevent the Operator from creating additional secrets when a secret already exists for the cluster. (link: https://issues.redhat.com/browse/OCPBUGS-44351 [* OCPBUGS-44351 *])
    • Bug Fix
    • Done

      This is a clone of issue OCPBUGS-42718. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-15462. The following is the description of the original issue:

      Description of problem:

      The ClusterResourceOverride operator is creating constantly a high quantity of secrets in the etcd from OCP crashing the utilization of the cluster. Due this peak of memory caused by these requests, ~60 containers from master node change status to exit due OOM-killer
      
      After the upgrade from 4.10 to 4.12, we could note that the etcd objects from the cluster was increasing fastly. They increased from 500MB to 2.5GB in 2 days. 
      
      When we checked the etcd objects, we could identify that the ClusterResourceOverride operator was creating a huge quantity of secrets without identified cause: 
      
      
      Old report of objects in the cluster: 
      
      [NUMBER OF OBJECTS IN ETCD] 
      273139 events
        38353 secrets <<<<<<<
      
      When we checked the secrets in the oc command line, we receive a quantity of 1.5k secrets. 
      
      
      We can observe in the master that the high utilization is coming from the API SERVER: 
      
      
      Top MEM-using processes: 
          USER      PID    %CPU  %MEM  VSZ-MiB  RSS-MiB  TTY    STAT   START  TIME  COMMAND  
          root      52188  133   38.9  26447    18792    ?      -      10:36  3:05  kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --
      
      
      We can also observe these messages in the kube-apiserver: 
      
      /kube-apiserver/kube-apiserver/logs/previous.log:2023-06-22T07:11:17.036606087Z I0622 07:11:17.036542      18 trace.go:205] Trace[1639443765]: "Get" url:/api/v1/namespaces/openshift-clusterresourceoverride/secrets/clusterresourceoverride-token-xxx,user-agent:kube-controller-manager/v1.25.8+37a9a08 (linux/amd64) kubernetes/df8a1b9/system:serviceaccount:kube-system:generic-garbage-collector,audit-id:aaef554b-06dd-4801-94e8-601b5ce75a10,client:10.127.206.24,accept:application/vnd.kubernetes.protobuf;as=PartialObjectMetadata;g=meta.k8s.io;v=v1,application/json;as=PartialObjectMetadata;g=meta.k8s.io;v=v1,application/json,protocol:HTTP/2.0 (22-Jun-2023 07:10:21.424) (total time: 55611ms):
      
      kube-apiserver/kube-apiserver/logs/previous.log:2023-06-22T07:11:17.037224152Z E0622 07:11:17.036738      18 timeout.go:141] post-timeout activity - time-elapsed: 1.417764462s, GET "/api/v1/namespaces/openshift-clusterresourceoverride/secrets/clusterresourceoverride-token-nmzvt" result: <nil>
      
      
      
      

       

      Version-Release number of selected component (if applicable):

       

      How reproducible:

       

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

      High quantity of secrets in the etcd

      Expected results:

      Stopping the requests of adding secrets

      Additional info:

      In this case, the customer disabled the operator.

            [OCPBUGS-44351] ClusterResourceOverride Operator creating high quantity of clusterresourceoverride-token secrets (57k) in the OCP cluster

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Moderate: OpenShift Container Platform 4.16.23 bug fix and security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2024:9615

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Moderate: OpenShift Container Platform 4.16.23 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:9615

            Aditi Sahay added a comment - - edited

            I have verified this on 4.16 CRO. Here are the logs:

            $ oc get clusterversion
            NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
            version   4.16.0-0.nightly-2024-11-08-100216   True        False         112m    Cluster version is 4.16.0-0.nightly-2024-11-08-100216
            $ oc get csv
            NAME                                                    DISPLAY                            VERSION               REPLACES   PHASE
            clusterresourceoverride-operator.v4.16.0-202411081636   ClusterResourceOverride Operator   4.16.0-202411081636              Succeeded
            
            
            

            Before running killing the root pid :

            $ oc get pods -n clusterresourceoverride-operator 
            NAME                                                READY   STATUS    RESTARTS   AGE
            clusterresourceoverride-6cwqf                       1/1     Running   0          3m46s
            clusterresourceoverride-dwh2c                       1/1     Running   0          3m46s
            clusterresourceoverride-hpq4w                       1/1     Running   0          3m46s
            clusterresourceoverride-operator-7d99d5c447-jkzwd   1/1     Running   0          4m28s
            
            

            After running : while true; do echo "-- killing pods--"; for i in `oc get pods --no-headers | cut -d' ' -f1 | grep clusterresourceoverride | grep -vi operator`; do oc exec $i – /bin/sh -c "kill 1" ; done ; done

            $ oc get pods -n clusterresourceoverride-operator
            NAME                                                READY   STATUS             RESTARTS      AGE
            clusterresourceoverride-6cwqf                       0/1     CrashLoopBackOff   2 (26s ago)   5m40s
            clusterresourceoverride-dwh2c                       0/1     CrashLoopBackOff   2 (22s ago)   5m40s
            clusterresourceoverride-hpq4w                       0/1     CrashLoopBackOff   2 (18s ago)   5m40s
            clusterresourceoverride-operator-7d99d5c447-jkzwd   1/1     Running            0             6m22s
            $ oc logs -f clusterresourceoverride-operator-7d99d5c447-jkzwd
            W1111 08:55:09.198480       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
            I1111 08:55:09.198837       1 start.go:100] [operator] configuration - name=clusterresourceoverride namespace=clusterresourceoverride-operator operand-image=registry.redhat.io/openshift4/ose-clusterresourceoverride-rhel9@sha256:662cfd046e7e00bbee1db4bfa6f850dff375fa4680e3427e6bdfd0d77f9c7c5a operand-version=1.0.0
            I1111 08:55:09.198861       1 start.go:101] [operator] starting
            I1111 08:55:09.213624       1 reflector.go:351] Caches populated for *v1.Service from k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229
            I1111 08:55:09.214435       1 reflector.go:351] Caches populated for *v1.MutatingWebhookConfiguration from k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229
            I1111 08:55:09.216456       1 reflector.go:351] Caches populated for *v1.Pod from k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229
            I1111 08:55:09.216496       1 reflector.go:351] Caches populated for *v1.Secret from k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229
            I1111 08:55:09.216840       1 reflector.go:351] Caches populated for *v1.DaemonSet from k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229
            I1111 08:55:09.217275       1 reflector.go:351] Caches populated for *v1.Deployment from k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229
            I1111 08:55:09.217336       1 reflector.go:351] Caches populated for *v1.ServiceAccount from k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229
            I1111 08:55:09.218118       1 reflector.go:351] Caches populated for *v1.ConfigMap from k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229
            I1111 08:55:09.300037       1 runner.go:39] [controller] name=clusterresourceoverride starting informer
            I1111 08:55:09.300150       1 runner.go:42] [controller] name=clusterresourceoverride waiting for informer cache to sync
            I1111 08:55:09.315858       1 reflector.go:351] Caches populated for *v1.ClusterResourceOverride from k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229
            I1111 08:55:09.400850       1 runner.go:52] [controller] name=clusterresourceoverride started 1 worker(s)
            I1111 08:55:09.400969       1 runner.go:54] [controller] name=clusterresourceoverride waiting 
            I1111 08:55:09.400883       1 worker.go:22] [controller] name=clusterresourceoverride starting to process work item(s)
            I1111 08:55:09.400994       1 run.go:98] operator is waiting for controller(s) to be done
            I1111 08:55:09.401002       1 start.go:110] [operator] operator is running, waiting for the operator to be done.
            I1111 08:55:43.228004       1 configuration.go:60] key=cluster resource=*v1.ConfigMap/clusterresourceoverride-configuration successfully created
            I1111 08:55:43.228027       1 configuration.go:75] key=cluster resource=*v1.ConfigMap/clusterresourceoverride-configuration configuration has drifted
            I1111 08:55:43.247064       1 configuration.go:95] key=cluster resource=*v1.ConfigMap/clusterresourceoverride-configuration resource-version=62006 setting object reference
            I1111 08:55:43.270600       1 cert_generation.go:121] key=cluster resource=*v1.Secret/server-serving-cert-clusterresourceoverride successfully ensured
            I1111 08:55:43.270619       1 cert_generation.go:124] key=cluster resource=*v1.ConfigMap/clusterresourceoverride-service-serving successfully ensured
            I1111 08:55:43.270630       1 cert_generation.go:134] key=cluster resource=*v1.Secret/server-serving-cert-clusterresourceoverride resource-version=62007 setting object reference
            I1111 08:55:43.270634       1 cert_generation.go:137] key=cluster resource=*v1.Secret/server-serving-cert-clusterresourceoverride is original sync
            I1111 08:55:43.270643       1 cert_generation.go:146] key=cluster resource=*v1.ConfigMap/clusterresourceoverride-service-serving resource-version=62008 setting object reference
            I1111 08:55:43.270648       1 cert_generation.go:149] key=cluster resource=*v1.ConfigMap/clusterresourceoverride-service-serving is original sync
            I1111 08:55:43.270667       1 cert_ready.go:63] key=cluster cert check passed
            I1111 08:55:43.281081       1 deploy.go:159] key=cluster ensured RBAC resource clusterresourceoverride
            I1111 08:55:43.380579       1 deploy.go:159] key=cluster ensured RBAC resource extension-server-authentication-reader-clusterresourceoverride
            I1111 08:55:43.454986       1 deploy.go:159] key=cluster ensured RBAC resource system:clusterresourceoverride-requester
            I1111 08:55:43.534026       1 deploy.go:159] key=cluster ensured RBAC resource default-aggregated-apiserver-clusterresourceoverride
            I1111 08:55:43.625343       1 deploy.go:159] key=cluster ensured RBAC resource default-aggregated-apiserver-clusterresourceoverride
            I1111 08:55:43.720137       1 deploy.go:159] key=cluster ensured RBAC resource auth-delegator-clusterresourceoverride
            I1111 08:55:43.912606       1 deploy.go:159] key=cluster ensured RBAC resource clusterresourceoverride-scc-hostnetwork-use
            I1111 08:55:44.111179       1 deploy.go:159] key=cluster ensured RBAC resource clusterresourceoverride-scc-hostnetwork-use
            I1111 08:55:44.284298       1 deploy.go:159] key=cluster ensured RBAC resource clusterresourceoverride-anonymous-access
            I1111 08:55:44.484893       1 deploy.go:159] key=cluster ensured RBAC resource clusterresourceoverride-anonymous-access
            I1111 08:55:44.620839       1 deploy.go:83] key=cluster resource=*v1.DaemonSet/clusterresourceoverride successfully ensured
            I1111 08:55:44.620936       1 deploy.go:97] key=cluster resource=*v1.DaemonSet/clusterresourceoverride resource-version=62033 setting object reference
            I1111 08:55:44.620977       1 deployment_ready.go:37] key=cluster resource=clusterresourceoverride deployment is not ready
            E1111 08:55:44.631382       1 worker.go:67] error syncing '/cluster': waiting for daemonset spec update name=clusterresourceoverride, requeuing
            I1111 08:55:44.631517       1 configuration.go:70] key=cluster resource=*v1.ConfigMap/clusterresourceoverride-configuration is in sync
            I1111 08:55:44.631532       1 cert_generation.go:137] key=cluster resource=*v1.Secret/server-serving-cert-clusterresourceoverride is original sync
            I1111 08:55:44.631538       1 cert_generation.go:149] key=cluster resource=*v1.ConfigMap/clusterresourceoverride-service-serving is original sync
            I1111 08:55:44.631551       1 cert_ready.go:63] key=cluster cert check passed
            
            

            No new secrets has being created:

            $ oc get secrets
            NAME                                               TYPE                      DATA   AGE
            builder-dockercfg-jgpkd                            kubernetes.io/dockercfg   1      9m4s
            clusterresourceoverride-dockercfg-x6nhz            kubernetes.io/dockercfg   1      7m53s
            clusterresourceoverride-operator-dockercfg-vh54f   kubernetes.io/dockercfg   1      8m34s
            default-dockercfg-7wlkn                            kubernetes.io/dockercfg   1      9m4s
            deployer-dockercfg-sg844                           kubernetes.io/dockercfg   1      9m4s
            server-serving-cert-clusterresourceoverride        kubernetes.io/tls         2      7m53s
            
            

            Aditi Sahay added a comment - - edited I have verified this on 4.16 CRO. Here are the logs: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-11-08-100216 True False 112m Cluster version is 4.16.0-0.nightly-2024-11-08-100216 $ oc get csv NAME DISPLAY VERSION REPLACES PHASE clusterresourceoverride- operator .v4.16.0-202411081636 ClusterResourceOverride Operator 4.16.0-202411081636 Succeeded Before running killing the root pid : $ oc get pods -n clusterresourceoverride- operator NAME READY STATUS RESTARTS AGE clusterresourceoverride-6cwqf 1/1 Running 0 3m46s clusterresourceoverride-dwh2c 1/1 Running 0 3m46s clusterresourceoverride-hpq4w 1/1 Running 0 3m46s clusterresourceoverride- operator -7d99d5c447-jkzwd 1/1 Running 0 4m28s After running : while true; do echo "-- killing pods--"; for i in `oc get pods --no-headers | cut -d' ' -f1 | grep clusterresourceoverride | grep -vi operator`; do oc exec $i – /bin/sh -c "kill 1" ; done ; done $ oc get pods -n clusterresourceoverride- operator NAME READY STATUS RESTARTS AGE clusterresourceoverride-6cwqf 0/1 CrashLoopBackOff 2 (26s ago) 5m40s clusterresourceoverride-dwh2c 0/1 CrashLoopBackOff 2 (22s ago) 5m40s clusterresourceoverride-hpq4w 0/1 CrashLoopBackOff 2 (18s ago) 5m40s clusterresourceoverride- operator -7d99d5c447-jkzwd 1/1 Running 0 6m22s $ oc logs -f clusterresourceoverride- operator -7d99d5c447-jkzwd W1111 08:55:09.198480 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I1111 08:55:09.198837 1 start.go:100] [ operator ] configuration - name=clusterresourceoverride namespace=clusterresourceoverride- operator operand-image=registry.redhat.io/openshift4/ose-clusterresourceoverride-rhel9@sha256:662cfd046e7e00bbee1db4bfa6f850dff375fa4680e3427e6bdfd0d77f9c7c5a operand-version=1.0.0 I1111 08:55:09.198861 1 start.go:101] [ operator ] starting I1111 08:55:09.213624 1 reflector.go:351] Caches populated for *v1.Service from k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229 I1111 08:55:09.214435 1 reflector.go:351] Caches populated for *v1.MutatingWebhookConfiguration from k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229 I1111 08:55:09.216456 1 reflector.go:351] Caches populated for *v1.Pod from k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229 I1111 08:55:09.216496 1 reflector.go:351] Caches populated for *v1.Secret from k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229 I1111 08:55:09.216840 1 reflector.go:351] Caches populated for *v1.DaemonSet from k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229 I1111 08:55:09.217275 1 reflector.go:351] Caches populated for *v1.Deployment from k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229 I1111 08:55:09.217336 1 reflector.go:351] Caches populated for *v1.ServiceAccount from k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229 I1111 08:55:09.218118 1 reflector.go:351] Caches populated for *v1.ConfigMap from k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229 I1111 08:55:09.300037 1 runner.go:39] [controller] name=clusterresourceoverride starting informer I1111 08:55:09.300150 1 runner.go:42] [controller] name=clusterresourceoverride waiting for informer cache to sync I1111 08:55:09.315858 1 reflector.go:351] Caches populated for *v1.ClusterResourceOverride from k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229 I1111 08:55:09.400850 1 runner.go:52] [controller] name=clusterresourceoverride started 1 worker(s) I1111 08:55:09.400969 1 runner.go:54] [controller] name=clusterresourceoverride waiting I1111 08:55:09.400883 1 worker.go:22] [controller] name=clusterresourceoverride starting to process work item(s) I1111 08:55:09.400994 1 run.go:98] operator is waiting for controller(s) to be done I1111 08:55:09.401002 1 start.go:110] [ operator ] operator is running, waiting for the operator to be done. I1111 08:55:43.228004 1 configuration.go:60] key=cluster resource=*v1.ConfigMap/clusterresourceoverride-configuration successfully created I1111 08:55:43.228027 1 configuration.go:75] key=cluster resource=*v1.ConfigMap/clusterresourceoverride-configuration configuration has drifted I1111 08:55:43.247064 1 configuration.go:95] key=cluster resource=*v1.ConfigMap/clusterresourceoverride-configuration resource-version=62006 setting object reference I1111 08:55:43.270600 1 cert_generation.go:121] key=cluster resource=*v1.Secret/server-serving-cert-clusterresourceoverride successfully ensured I1111 08:55:43.270619 1 cert_generation.go:124] key=cluster resource=*v1.ConfigMap/clusterresourceoverride-service-serving successfully ensured I1111 08:55:43.270630 1 cert_generation.go:134] key=cluster resource=*v1.Secret/server-serving-cert-clusterresourceoverride resource-version=62007 setting object reference I1111 08:55:43.270634 1 cert_generation.go:137] key=cluster resource=*v1.Secret/server-serving-cert-clusterresourceoverride is original sync I1111 08:55:43.270643 1 cert_generation.go:146] key=cluster resource=*v1.ConfigMap/clusterresourceoverride-service-serving resource-version=62008 setting object reference I1111 08:55:43.270648 1 cert_generation.go:149] key=cluster resource=*v1.ConfigMap/clusterresourceoverride-service-serving is original sync I1111 08:55:43.270667 1 cert_ready.go:63] key=cluster cert check passed I1111 08:55:43.281081 1 deploy.go:159] key=cluster ensured RBAC resource clusterresourceoverride I1111 08:55:43.380579 1 deploy.go:159] key=cluster ensured RBAC resource extension-server-authentication-reader-clusterresourceoverride I1111 08:55:43.454986 1 deploy.go:159] key=cluster ensured RBAC resource system:clusterresourceoverride-requester I1111 08:55:43.534026 1 deploy.go:159] key=cluster ensured RBAC resource default -aggregated-apiserver-clusterresourceoverride I1111 08:55:43.625343 1 deploy.go:159] key=cluster ensured RBAC resource default -aggregated-apiserver-clusterresourceoverride I1111 08:55:43.720137 1 deploy.go:159] key=cluster ensured RBAC resource auth-delegator-clusterresourceoverride I1111 08:55:43.912606 1 deploy.go:159] key=cluster ensured RBAC resource clusterresourceoverride-scc-hostnetwork-use I1111 08:55:44.111179 1 deploy.go:159] key=cluster ensured RBAC resource clusterresourceoverride-scc-hostnetwork-use I1111 08:55:44.284298 1 deploy.go:159] key=cluster ensured RBAC resource clusterresourceoverride-anonymous-access I1111 08:55:44.484893 1 deploy.go:159] key=cluster ensured RBAC resource clusterresourceoverride-anonymous-access I1111 08:55:44.620839 1 deploy.go:83] key=cluster resource=*v1.DaemonSet/clusterresourceoverride successfully ensured I1111 08:55:44.620936 1 deploy.go:97] key=cluster resource=*v1.DaemonSet/clusterresourceoverride resource-version=62033 setting object reference I1111 08:55:44.620977 1 deployment_ready.go:37] key=cluster resource=clusterresourceoverride deployment is not ready E1111 08:55:44.631382 1 worker.go:67] error syncing '/cluster' : waiting for daemonset spec update name=clusterresourceoverride, requeuing I1111 08:55:44.631517 1 configuration.go:70] key=cluster resource=*v1.ConfigMap/clusterresourceoverride-configuration is in sync I1111 08:55:44.631532 1 cert_generation.go:137] key=cluster resource=*v1.Secret/server-serving-cert-clusterresourceoverride is original sync I1111 08:55:44.631538 1 cert_generation.go:149] key=cluster resource=*v1.ConfigMap/clusterresourceoverride-service-serving is original sync I1111 08:55:44.631551 1 cert_ready.go:63] key=cluster cert check passed No new secrets has being created: $ oc get secrets NAME TYPE DATA AGE builder-dockercfg-jgpkd kubernetes.io/dockercfg 1 9m4s clusterresourceoverride-dockercfg-x6nhz kubernetes.io/dockercfg 1 7m53s clusterresourceoverride- operator -dockercfg-vh54f kubernetes.io/dockercfg 1 8m34s default -dockercfg-7wlkn kubernetes.io/dockercfg 1 9m4s deployer-dockercfg-sg844 kubernetes.io/dockercfg 1 9m4s server-serving-cert-clusterresourceoverride kubernetes.io/tls 2 7m53s

              joelsmith.redhat Joel Smith
              openshift-crt-jira-prow OpenShift Prow Bot
              Aditi Sahay Aditi Sahay
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: