Loading...

XML

Word

Printable

Type: Feature
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: ai-ml-workloads, Node
Labels:

Activity Type:
Product / Portfolio Work
Parent Link:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Size:
None

Target Version:
None
Release Blocker:
None
Release Type:
Tech Preview

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
None
PX Priority Data:
None
PX Impact Score:
PX Technical Impact:
None
PX Impact Range:
None
PX Scheduling Request:
None
PX Technical Impact Notes:
None

Intelligence Requested:
Market:

Feature Overview (aka. Goal Summary)

An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.

----- copied from RFE-7875 -----
1. Proposed title of this feature request

Update Resources of Suspended Job in Kueue

2. What is the nature and description of the request?

Currently, jobs submitted to Kueue need to have specific resource limits and requests that cannot be changed. This change proposes updating Kueue to enable changing the resource limits and requests for certain jobs (PyTorch Job, TrainJob, Train Runtime) while the jobs are in suspended state. The updates would extend from CPU and Memory to devices, specifically GPU requests.

Our proposal is to have 2 parts of this update - one, a new interface for GenericJob objects, similar to JobWithSkip{}, with a SkipWhileUpdating() function that will return true when the resource limits and requests are being changed; and two, a plugin interface that will invoke the function that will actually patch the job objects with updated resources.

This interface can be implemented for PyTorchJob as a demonstrator.

3. Why does the customer need this? (List the business requirements here)

Fine-tuning large language models (LLMs) is a GPU-intensive process whose resource needs vary significantly depending on the model architecture, dataset size, sequence length, and token count. However, current job specifications in Kueue are static. Once a job is submitted its resource requests cannot be changed. This rigidity leads to inefficiencies: jobs may remain queued indefinitely waiting for their exact GPU request to be fulfilled, even if they could proceed with fewer resources. Moreover, users often misestimate their resource needs due to the complexity of LLM workloads, and even accurate estimates fail to account for real-time cluster conditions or job priorities. Without elasticity, this results in resource fragmentation, longer wait times, and underutilized hardware. Enabling Kueue to dynamically adjust job resource requirements would improve scheduling flexibility, reduce idle time, and make AI workloads more responsive to the realities of shared infrastructure.

While we’re starting with PytorchJob, there’s a clear need to eventually support TrainJob. With PytorchJob being deprecated in Kubeflow Training Operator 2.0, TrainJob is the future-facing API. Supporting elasticity there ensures long-term compatibility.

4. List any affected packages or components.

OCP Kueue operator
------------------------------------------------------

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.

This Feature was generated in OCPSTRAT via acceptance of RFE-7875. Ensure the stated Acceptance Criteria below will fulfill the needs specified in the RFE.

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

relates to

RFE-7875 OCP Kueue: Support for changing Job resource limits and requests

Approved

Assignee:: Duncan Hardie

Reporter:: Srikumar Venugopal

Need Info From:: None

Contributors:: None

Architect:: None

QA Contact:: None

Doc Contact:: None

Product Operations Engineering Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2025/11/03 4:34 PM

Updated:: 2025/11/03 7:21 PM

Details

Description

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Use Cases (Optional):

Questions to Answer (Optional):

Out of Scope

Background

Customer Considerations

Documentation Considerations

Interoperability Considerations

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates