Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.15, 4.16, 4.17, 4.18
Component/s: HyperShift
Labels:
- triaged

Severity:
Important
Regression:
Yes
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.18.z
Target Backport Versions:

4.18.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

After updating Hypershift from v4.15.6 to v4.15.8, the memory and CPU usage rapidly increase causing the pod to eventually crash. The issue is caused by the sheer number of clusters being managed and how quickly they are reconciled. At the core of the issue is the repeated lookup of release images during the reconciliation of the hosted cluster, control plane, and various other controllers. Very quickly after starting reconciliation, the image lookup requests start to fail due to rate limiting (toomanyrequests errors). The memory also increases due to the number of errors/loggers being created, specifically, go.uber.org/zap/internal/bufferpool.init.NewPool.func1 memory usage eventually comes to dominate the heap in-use space.

Version-Release number of selected component (if applicable):

v4.15.8

How reproducible:

Create an environment where hypershift is managing 375 clusters.

Steps to Reproduce:

    1. 
    2.
    3.

Actual results:

Expected results:

Additional info:

Hypershift changes between versions v4.15.6 and v4.15.8: https://github.com/openshift/hypershift/compare/a122f2...c1efc7    

Example toomanyrequests error during HhostedCluster reconcile:
{"level":"error","ts":"2024-07-03T14:21:13Z","msg":"Reconciler error","controller":"hostedcluster","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedCluster","HostedCluster":{"name":"cnu358v20k8t03do3cmg","namespace":"master"},"namespace":"master","name":"cnu358v20k8t03do3cmg","reconcileID":"aa0234d2-801e-4a8b-a16e-eda34475ba31","error":"failed to lookup release image: failed to extract release metadata: failed to obtain root manifest for us.icr.io/armada-master/ocp-release:4.15.17-x86_64: toomanyrequests","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}

A complete fix will likely include:
1. Image caches for release image lookups
2. Configurable control variable for pausing controllers before processing the next work item or some manner of reducing the speed at which items are processed.
3. Expose control variable for controller max concurrent reconciles
4. Global logger instead of creating a new logger on every reconciliation
5. Stop using Dev mode "zap.UseDevMode(true)" for new loggers

account is impacted by

OCPBUGS-52821 Stop using DevMode for loggers

Closed

Assignee:: Unassigned

Reporter:: Ryan Cradick

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/03/03 8:25 PM

Updated:: 2025/03/10 3:20 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates