Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-1778

OpenShift CMA Scaler for GPU ( Horizontal pod autoscaling )

XMLWordPrintable

    • BU Product Work
    • False
    • Hide

      None

      Show
      None
    • False
    • OCPSTRAT-1692AI Workloads for OpenShift
    • 100% To Do, 0% In Progress, 0% Done
    • 0

      Feature Overview (aka. Goal Summary)  

      The OpenShift Custom Metric Autoscaler (CMA) Scaler for GPU workloads is designed to provide intelligent autoscaling for GPU-driven applications, such as AI/ML and LLM inference tasks. The CMA Scaler utilizes GPU-specific metrics to manage scaling more efficiently, allowing users to meet performance targets while minimizing the cost of unused GPU resources.

      Key Metrics for GPU-Based Autoscaling
      The CMA Scaler offers advanced metrics that provide deeper insights into GPU workloads, helping to optimize resource allocation based on actual GPU demand:

      1. Batch Size
        • Metric: tgi_batch_current_size
        • Description: Tracks the number of requests processed in each GPU batch.
        • Use Case: Effective for latency-sensitive applications, ensuring low latency by scaling in response to active GPU load.
        • Benefits: Directly correlates to real-time processing needs, allowing for targeted scaling to reduce response times for end-users.
      1. Queue Size
        • Metric: tgi_queue_size
        • Description: Measures the number of requests waiting to be processed on the GPU.
        • Use Case: Ideal for high-throughput applications, helping to manage large traffic volumes by scaling up as queue size increases.
        • Benefits: Triggers scaling when the queue grows, ensuring capacity for incoming requests while maintaining steady throughput.
      1. GPU Utilization (optional)
        • Metric: GPU duty cycle or utilization.
        • Description: Reflects the active processing time of the GPU.
        • Limitations: While it shows the GPU’s activity level, it lacks specificity in workload intensity, making it less effective as a standalone scaling metric.
        • Recommendation: Use as a secondary metric to gain additional GPU utilization insight but not as the primary trigger due to potential overprovisioning.

       

              gausingh@redhat.com Gaurav Singh
              gausingh@redhat.com Gaurav Singh
              Matthew Werner Matthew Werner
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: