Uploaded image for project: 'OpenShift Monitoring'
  1. OpenShift Monitoring
  2. MON-1901

KEDA productization consultancy

XMLWordPrintable

    • Icon: Task Task
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • None
    • Monitoring - Sprint 207, Monitoring - Sprint 208, Monitoring - Sprint 209

      After discussing the replacement of prometheus-adapter by KEDA to serve custom metrics, we were tasked to find gaps in the current solution that would need to be covered to make autoscaling on custom metrics a tech preview feature.

      We identified the following gaps that need to be addressed before KEDA is seamlessly integrated with the OpenShift monitoring stack. 

      1. When users define a ScaledObject in OpenShift, they will need to specify the URL from the Thanos Query that is available in OpenShift. This URL is not publicly documented and could also be changed or removed in the future. Furthermore, they will also need to specify the ScalerObject’s namespace in a namespace query parameter. In order to relieve this burden from the user, the bundle in the downstream OLM catalog needs to deploy a mutating webhook which is going to properly configure the server address of all ScaledObjects which are deployed without a server address.
      2. The Thanos Querier is protected by kube-rbac-proxy which authenticates callers based on whether they have access to the namespace that they’re querying metrics from. In order for authentication to succeed, KEDA will need to pass its kubernetes service account token when querying Thanos. If we want users to seamlessly integrate with the OpenShift monitoring stack without worrying about this implementation detail:
        • The bundle in the downstream OLM catalog needs to create a ClusterTriggerAuthentication when deploying KEDA
        • The mutating webhook from 1. needs to include the ClusterTriggerAuthentication in ScaledObjects that references the OpenShift monitoring Thanos Query as their server address

      Since users can specify an arbitrary endpoint as the server address, we need to be careful to not expose the service account token to an external system. The options we have are:

      • Not allowing users to define a server address at all
      • Removing the ClusterTriggerAuthentication from all ScaledObjects whose server address is different from the one exposed by the Thanos Query in the OpenShift monitoring stack

      Monitoring tasks:

      KEDA tasks:

      Set a defaults in the prometheus scaler object if no server address is specified

      When installing KEDA in OpenShift, a mutating webhook needs to be deployed to make sure certain parameters are configured with proper defaults.
      If users create a ScaledObject resource without a serverAddress, the tenant-specific endpoint of the Thanos Querier needs to be injected together with the namespace of the ScaledObject as a query parameter. This will ensure that by default, users would auto-scale their workloads based on metrics available in the OpenShift monitoring stack. In addition, the mutating webhook will have to inject the authentication and authorization pieces required to query Thanos Querier. To prevent any leaks, these secrets should be hosted in KEDA's namespace and injected via a ClusterTriggerAuthentication resource.

      Example:

      If a scaled object has the following parameters:

       ...
       metadata:
       ...
       namespace: super-important-project
       triggers:
       - type: prometheus
       metadata:
       metricName: http_requests_total
       threshold: ‘100’
       query: sum(rate(http_requests_total\{deployment="my-deployment"}[2m]))
      

      The mutating webhook needs to set the serverAddress of the ScaledObject to:
      https://thanos-querier.openshift-monitoring:9092?namespace=super-important-project and the authenticationRef to the ClusterTriggerAuthentication resource in KEDA's namespace that will reference the secrets required to query Thanos Query.

      Authenticate KEDA with the Thanos Query tenancy-specific endpoint

      The Thanos Querier is protected with kube-rbac-proxy, an HTTP proxy that verifies whether the client making the request has access to the namespace it is querying metrics from.

      In order for KEDA to authenticate with the proxy, it needs to send its service account token in the HTTP Authorization header using the Bearer scheme. This needs to happen only when the serverAddress in the ScaledObject points to the Thanos Query endpoint. Otherwise, KEDA could leak its service account token to an external endpoint.

      For authentication to succeed, KEDA will need to have GET RBAC permissions on all namespaces in which autoscaling should be supported.

      OpenShift HA requirements for KEDA’s metrics-server:

      • Two replicas
      • Hard pod anti-affinity on hostname (no two pods on the same node)
      • Use the maxUnavailable rollout strategy on deployments (prefer 1 by default for a value)
      • Add PodDisruptionBudget with mindAvailable: 1

      Tasks to consider before graduating to GA

      • Cache query results to reduce the load on the Prometheus server
      • Spread queries to Prometheus over a time window to prevent bursts
      • Add a validation webhook validating the PromQL queries

              dgrisonn@redhat.com Damien Grisonnet
              dgrisonn@redhat.com Damien Grisonnet
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: