-
Feature
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
BU Product Work
-
False
-
-
False
-
OCPSTRAT-1692AI Workloads for OpenShift
-
100% To Do, 0% In Progress, 0% Done
-
0
Feature Overview (aka. Goal Summary)
The OpenShift Custom Metric Autoscaler (CMA) Scaler for GPU workloads is designed to provide intelligent autoscaling for GPU-driven applications, such as AI/ML and LLM inference tasks. The CMA Scaler utilizes GPU-specific metrics to manage scaling more efficiently, allowing users to meet performance targets while minimizing the cost of unused GPU resources.
Key Metrics for GPU-Based Autoscaling
The CMA Scaler offers advanced metrics that provide deeper insights into GPU workloads, helping to optimize resource allocation based on actual GPU demand:
- Batch Size
-
- Metric: tgi_batch_current_size
- Description: Tracks the number of requests processed in each GPU batch.
- Use Case: Effective for latency-sensitive applications, ensuring low latency by scaling in response to active GPU load.
- Benefits: Directly correlates to real-time processing needs, allowing for targeted scaling to reduce response times for end-users.
- Queue Size
-
- Metric: tgi_queue_size
- Description: Measures the number of requests waiting to be processed on the GPU.
- Use Case: Ideal for high-throughput applications, helping to manage large traffic volumes by scaling up as queue size increases.
- Benefits: Triggers scaling when the queue grows, ensuring capacity for incoming requests while maintaining steady throughput.
- GPU Utilization (optional)
-
- Metric: GPU duty cycle or utilization.
- Description: Reflects the active processing time of the GPU.
- Limitations: While it shows the GPU’s activity level, it lacks specificity in workload intensity, making it less effective as a standalone scaling metric.
- Recommendation: Use as a secondary metric to gain additional GPU utilization insight but not as the primary trigger due to potential overprovisioning.