Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-7713

Allow to configure resource requests and limit for Kueue's controller manager

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Won't Do
    • Icon: Normal Normal
    • None
    • None
    • AI/ML Workloads
    • None
    • None
    • Product / Portfolio Work
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      1. Proposed title of this feature request

      Allow to configure resource requests and limit for Kueue's controller manager 

      2. What is the nature and description of the request?

      Allow users to customize the resource requirements of Kueue's controller manager when deploying using the operator.

      3. Why does the customer need this? (List the business requirements here)

      Different customer will have different amount of Workloads, ClusterQueues, LocalQueues etc... those resources load Kueue's controller manager, thus when more of them exist the controller manages needs more resources (memory/cpu). Without it the its pod may crash (OOM) or throttled.

       

      The current request and limits are:

              resources:
                limits:
                  cpu: "2"
                  memory: 512Mi
                requests:
                  cpu: 500m
                  memory: 512Mi// code placeholder
      

       

       

      In one of our clusters (Konflux) the controller crashed because OOM. We had ~500 localqueues (in other cluster we can have more) and ~30 workloads. 

       

      I'm adding a screenshot of the memory usage (after I patched the deployment manually and scaled the operator to 0 so it won't override my change)

      As a side note, upstream Kueue allows to configure the resources when installing using the Helm chart.

      4. List any affected packages or components.

       

      kueue controller manager (AI/ML workloads)

              julim Ju Lim
              gbenhaim Gal Ben Haim
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved:
                None
                None