Loading...

XML

Word

Printable

Type: Feature
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: Node
Labels:
- FPC:TODO-Close-ALL-Epics
- FPC:TODO-Create-Delivery-Epics

Work Type:
BU Product Work
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Parent Link:
OCPSTRAT-1692AI Workloads for OpenShift
Hierarchy Progress Bar:

100% To Do, 0% In Progress, 0% Done

Risk Score:
0

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Intelligence Requested:
Market:

Feature Overview (aka. Goal Summary)

The OpenShift Custom Metric Autoscaler (CMA) Scaler for GPU workloads is designed to provide intelligent autoscaling for GPU-driven applications, such as AI/ML and LLM inference tasks. The CMA Scaler utilizes GPU-specific metrics to manage scaling more efficiently, allowing users to meet performance targets while minimizing the cost of unused GPU resources.

Key Metrics for GPU-Based Autoscaling
The CMA Scaler offers advanced metrics that provide deeper insights into GPU workloads, helping to optimize resource allocation based on actual GPU demand:

Batch Size

- Metric: tgi_batch_current_size
- Description: Tracks the number of requests processed in each GPU batch.
- Use Case: Effective for latency-sensitive applications, ensuring low latency by scaling in response to active GPU load.
- Benefits: Directly correlates to real-time processing needs, allowing for targeted scaling to reduce response times for end-users.

Queue Size

- Metric: tgi_queue_size
- Description: Measures the number of requests waiting to be processed on the GPU.
- Use Case: Ideal for high-throughput applications, helping to manage large traffic volumes by scaling up as queue size increases.
- Benefits: Triggers scaling when the queue grows, ensuring capacity for incoming requests while maintaining steady throughput.

GPU Utilization (optional)

- Metric: GPU duty cycle or utilization.
- Description: Reflects the active processing time of the GPU.
- Limitations: While it shows the GPU’s activity level, it lacks specificity in workload intensity, making it less effective as a standalone scaling metric.
- Recommendation: Use as a secondary metric to gain additional GPU utilization insight but not as the primary trigger due to potential overprovisioning.

Assignee:: Gaurav Singh

Reporter:: Gaurav Singh

Doc Contact:: Matthew Werner

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/11/08 8:36 PM

Updated:: 2024/11/13 6:41 PM

Details

Description

Feature Overview (aka. Goal Summary)

Attachments

Easy Agile Planning Poker

Activity

People

Dates