Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-7875

OCP Kueue: Support for changing Job resource limits and requests

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • None
    • AI/ML Workloads, Node
    • None
    • Product / Portfolio Work
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      1. Proposed title of this feature request

      Update Resources of Suspended Job in Kueue

      2. What is the nature and description of the request?

      Currently, jobs submitted to Kueue need to have specific resource limits and requests that cannot be changed. This change proposes updating Kueue to enable changing the resource limits and requests for certain jobs (PyTorch Job, TrainJob, Train Runtime) while the jobs are in suspended state. The updates would extend from CPU and Memory to devices, specifically GPU requests.

      Our proposal is to have 2 parts of this update - one, a new interface for GenericJob objects, similar to JobWithSkip{}, with a SkipWhileUpdating() function that will return true when the resource limits and requests are being changed; and two, a plugin interface that will invoke the function that will actually patch the job objects with updated resources.

      This interface can be implemented for PyTorchJob as a demonstrator.

      3. Why does the customer need this? (List the business requirements here)

      Fine-tuning large language models (LLMs) is a GPU-intensive process whose resource needs vary significantly depending on the model architecture, dataset size, sequence length, and token count. However, current job specifications in Kueue are static. Once a job is submitted its resource requests cannot be changed. This rigidity leads to inefficiencies: jobs may remain queued indefinitely waiting for their exact GPU request to be fulfilled, even if they could proceed with fewer resources. Moreover, users often misestimate their resource needs due to the complexity of LLM workloads, and even accurate estimates fail to account for real-time cluster conditions or job priorities. Without elasticity, this results in resource fragmentation, longer wait times, and underutilized hardware. Enabling Kueue to dynamically adjust job resource requirements would improve scheduling flexibility, reduce idle time, and make AI workloads more responsive to the realities of shared infrastructure.

      While we’re starting with PytorchJob, there’s a clear need to eventually support TrainJob. With PytorchJob being deprecated in Kubeflow Training Operator 2.0, TrainJob is the future-facing API. Supporting elasticity there ensures long-term compatibility.

      4. List any affected packages or components.

      OCP Kueue operator

              rhn-support-dhardie Duncan Hardie
              srikumarv Srikumar Venugopal
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                None
                None