-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.15, 4.16, 4.17, 4.18
Description of problem:
After updating Hypershift from v4.15.6 to v4.15.8, the memory and CPU usage rapidly increase causing the pod to eventually crash. The issue is caused by the sheer number of clusters being managed and how quickly they are reconciled. At the core of the issue is the repeated lookup of release images during the reconciliation of the hosted cluster, control plane, and various other controllers. Very quickly after starting reconciliation, the image lookup requests start to fail due to rate limiting (toomanyrequests errors). The memory also increases due to the number of errors/loggers being created, specifically, go.uber.org/zap/internal/bufferpool.init.NewPool.func1 memory usage eventually comes to dominate the heap in-use space.
Version-Release number of selected component (if applicable):
v4.15.8
How reproducible:
Create an environment where hypershift is managing 375 clusters.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Hypershift changes between versions v4.15.6 and v4.15.8: https://github.com/openshift/hypershift/compare/a122f2...c1efc7 Example toomanyrequests error during HhostedCluster reconcile: {"level":"error","ts":"2024-07-03T14:21:13Z","msg":"Reconciler error","controller":"hostedcluster","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedCluster","HostedCluster":{"name":"cnu358v20k8t03do3cmg","namespace":"master"},"namespace":"master","name":"cnu358v20k8t03do3cmg","reconcileID":"aa0234d2-801e-4a8b-a16e-eda34475ba31","error":"failed to lookup release image: failed to extract release metadata: failed to obtain root manifest for us.icr.io/armada-master/ocp-release:4.15.17-x86_64: toomanyrequests","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"} A complete fix will likely include: 1. Image caches for release image lookups 2. Configurable control variable for pausing controllers before processing the next work item or some manner of reducing the speed at which items are processed. 3. Expose control variable for controller max concurrent reconciles 4. Global logger instead of creating a new logger on every reconciliation 5. Stop using Dev mode "zap.UseDevMode(true)" for new loggers
- account is impacted by
-
OCPBUGS-52821 Stop using DevMode for loggers
-
- Closed
-