Uploaded image for project: 'Observability and Data Analysis Program'
  1. Observability and Data Analysis Program
  2. OBSDA-1

Allow autoscaling via custom metrics

    XMLWordPrintable

Details

    • False
    • False
    • 100
    • 100% 100%
    • Undefined
    • 0

    Description

      Problem Alignment

      The Problem

      Not every workload uses CPU and Memory to determine if they need to be scaled up or down. For example, a webapp exposing an HTTP API needs to scale based on incoming traffic (number of HTTP requests).

      Autoscaling is a key feature of Kubernetes. There are different options you can use to scale either your workload (either the number of replicas or the resources attached to it) or the number of nodes in your cluster. With this feature, we will focus on scaling workload by either increasing or decreasing the number of replicas aka Horizontal Pod Autoscaler.

      The Horizontal Pod Autoscaler is implemented as a Kubernetes API resource and a controller. The controller manager obtains the metrics from either the resource metrics API (CPU and Memory), or the custom metrics API (for all other metrics). Currently, OpenShift only exposes an implementation for the resource metrics API which is available out-of-the-box with a standard OpenShift installation. It does not provide an implementation for the custom metrics API and therefore, customers can’t use any other metric to successfully operate a system of reliable, high SLA type of applications with minimal to non downtime during, for example, peak times.

      High-Level Approach

      Provide an installable component that implements the custom metrics API which exposes ANY metric collected by our OpenShift Monitoring system.

      Goal & Success

      • Increase the value for running workloads on OpenShift by providing tools to allow workloads to automatically scale based on agreed parameters to keep them up at all times.
      • Number of HPA objects that use a custom metric exposed by our Monitoring stack should grow.

      Solution Alignment

      Key Capabilities

      • A user must be able to configure a new HorizontalPodAutoscaler object using a metric other than CPU and Memory.

      Key Flows

      Imagine a company that runs a Shop web application on OpenShift. During Christmas, they usually see a huge spike on their “Buy” API and they’ve been struggling to fulfill all requests, ending up even dropping connections which basically means they quickly lose revenue. That company would like to use Horizontal Pod Autoscaler to scale up/down based on the throughput. Their threshold at the moment is around 20 requests per second.

      CLI Flow:

      • OpenShift Monitoring already collects HTTP specific metrics via HAproxy out-of-the-box.
      • Specify a query that will expose “throughput” as a single metric to HPA.
      • Configure an HPA object that uses the custom metric from step 2.

      UI Flow:

      • In the developer perspective, go to the “Buy” API deployment.
      • Click on the “Configure autoscaling” button.
      • In the new popup, users would need to set up everything necessary for HPA such as the metrics they’d like to use (the ones available to the user) and a threshold.

      Open Questions & Key Decisions (optional)

      • Do we foresee use cases where we do not have a Prometheus store locally and need to pull metrics from a remote storage (e.g. HyperShift)?
      • How do we want to provide the component that exposes the custom metrics API? Should that be either 1) installed with OpenShift Monitoring or 2) an admin must install and configure it.
      • Will we be limited on how many metrics that component can expose before there is a huge impact on the Prometheus cluster? If yes, how do we want to expose configuration to limit the required metrics exposed?
      • During my competitor analysis (see respective section in this doc) I found out that no one really provides the ability for users to define a “custom metric” based on a query instead of a plain metric available in the metrics store. Is that on purpose? For example, in the key flow, we highlighted “throughput” as the scale parameter. As per this description, “Throughput” is defined as a PromQL that contains metrics available via HAproxy. I also saw that the Prometheus Adapter provides a field where you can define a “metricsQuery” and expose the value as another metric. Is that what we want and if so, how do we expose something like that across tenants?
      • Do metrics have to be of a specific type (e.g. Gauge) or does HPA not really care about that?
      • Do we expect to support scaling workload outside an OpenShift cluster?
      • How do we let users configure the custom metrics they need particularly since you need to select the metrics for your specific pod.
      • Clarify HPA in HyperShift.

      Attachments

        Issue Links

          Activity

            People

              rh-ee-rfloren Roger Florén
              cvogel1 Christian Heidenreich (Inactive)
              Votes:
              9 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: