Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-52256

Hypershift issues at scale

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.15, 4.16, 4.17, 4.18
    • HyperShift
    • Important
    • Yes
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      After updating Hypershift from v4.15.6 to v4.15.8, the memory and CPU usage rapidly increase causing the pod to eventually crash. The issue is caused by the sheer number of clusters being managed and how quickly they are reconciled. At the core of the issue is the repeated lookup of release images during the reconciliation of the hosted cluster, control plane, and various other controllers. Very quickly after starting reconciliation, the image lookup requests start to fail due to rate limiting (toomanyrequests errors). The memory also increases due to the number of errors/loggers being created, specifically, go.uber.org/zap/internal/bufferpool.init.NewPool.func1 memory usage eventually comes to dominate the heap in-use space. 

      Version-Release number of selected component (if applicable):

      v4.15.8    

      How reproducible:

      Create an environment where hypershift is managing 375 clusters.     

      Steps to Reproduce:

          1. 
          2.
          3.
          

      Actual results:

          

      Expected results:

          

      Additional info:

      Hypershift changes between versions v4.15.6 and v4.15.8: https://github.com/openshift/hypershift/compare/a122f2...c1efc7    
      
      Example toomanyrequests error during HhostedCluster reconcile:
      {"level":"error","ts":"2024-07-03T14:21:13Z","msg":"Reconciler error","controller":"hostedcluster","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedCluster","HostedCluster":{"name":"cnu358v20k8t03do3cmg","namespace":"master"},"namespace":"master","name":"cnu358v20k8t03do3cmg","reconcileID":"aa0234d2-801e-4a8b-a16e-eda34475ba31","error":"failed to lookup release image: failed to extract release metadata: failed to obtain root manifest for us.icr.io/armada-master/ocp-release:4.15.17-x86_64: toomanyrequests","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}
      
      A complete fix will likely include:
      1. Image caches for release image lookups
      2. Configurable control variable for pausing controllers before processing the next work item or some manner of reducing the speed at which items are processed.
      3. Expose control variable for controller max concurrent reconciles
      4. Global logger instead of creating a new logger on every reconciliation
      5. Stop using Dev mode "zap.UseDevMode(true)" for new loggers 

              Unassigned Unassigned
              rcradick Ryan Cradick
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: