Loading...

Type: Task
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- groomed

Blocked:
False
Ready:
False
Epic Link:
GA support for "Custom Metric Autoscaler"
Docs QE Status:
NEW
QE Status:
NEW
Market:

Sprint:
Monitoring - Sprint 207, Monitoring - Sprint 208, Monitoring - Sprint 209

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

After discussing the replacement of prometheus-adapter by KEDA to serve custom metrics, we were tasked to find gaps in the current solution that would need to be covered to make autoscaling on custom metrics a tech preview feature.

We identified the following gaps that need to be addressed before KEDA is seamlessly integrated with the OpenShift monitoring stack.

When users define a ScaledObject in OpenShift, they will need to specify the URL from the Thanos Query that is available in OpenShift. This URL is not publicly documented and could also be changed or removed in the future. Furthermore, they will also need to specify the ScalerObject’s namespace in a namespace query parameter. In order to relieve this burden from the user, the bundle in the downstream OLM catalog needs to deploy a mutating webhook which is going to properly configure the server address of all ScaledObjects which are deployed without a server address.
The Thanos Querier is protected by kube-rbac-proxy which authenticates callers based on whether they have access to the namespace that they’re querying metrics from. In order for authentication to succeed, KEDA will need to pass its kubernetes service account token when querying Thanos. If we want users to seamlessly integrate with the OpenShift monitoring stack without worrying about this implementation detail:
- The bundle in the downstream OLM catalog needs to create a ClusterTriggerAuthentication when deploying KEDA
- The mutating webhook from 1. needs to include the ClusterTriggerAuthentication in ScaledObjects that references the OpenShift monitoring Thanos Query as their server address

Since users can specify an arbitrary endpoint as the server address, we need to be careful to not expose the service account token to an external system. The options we have are:

Not allowing users to define a server address at all
Removing the ClusterTriggerAuthentication from all ScaledObjects whose server address is different from the one exposed by the Thanos Query in the OpenShift monitoring stack

Monitoring tasks:

Review KEDA’s Prometheus scaler code
Experiment with KEDA in OpenShift to identify potential gaps

KEDA tasks:

Set a defaults in the prometheus scaler object if no server address is specified

When installing KEDA in OpenShift, a mutating webhook needs to be deployed to make sure certain parameters are configured with proper defaults.
If users create a ScaledObject resource without a serverAddress, the tenant-specific endpoint of the Thanos Querier needs to be injected together with the namespace of the ScaledObject as a query parameter. This will ensure that by default, users would auto-scale their workloads based on metrics available in the OpenShift monitoring stack. In addition, the mutating webhook will have to inject the authentication and authorization pieces required to query Thanos Querier. To prevent any leaks, these secrets should be hosted in KEDA's namespace and injected via a ClusterTriggerAuthentication resource.

Example:

If a scaled object has the following parameters:

 ...
 metadata:
 ...
 namespace: super-important-project
 triggers:
 - type: prometheus
 metadata:
 metricName: http_requests_total
 threshold: ‘100’
 query: sum(rate(http_requests_total\{deployment="my-deployment"}[2m]))

The mutating webhook needs to set the serverAddress of the ScaledObject to:
https://thanos-querier.openshift-monitoring:9092?namespace=super-important-project and the authenticationRef to the ClusterTriggerAuthentication resource in KEDA's namespace that will reference the secrets required to query Thanos Query.

Authenticate KEDA with the Thanos Query tenancy-specific endpoint

The Thanos Querier is protected with kube-rbac-proxy, an HTTP proxy that verifies whether the client making the request has access to the namespace it is querying metrics from.

In order for KEDA to authenticate with the proxy, it needs to send its service account token in the HTTP Authorization header using the Bearer scheme. This needs to happen only when the serverAddress in the ScaledObject points to the Thanos Query endpoint. Otherwise, KEDA could leak its service account token to an external endpoint.

For authentication to succeed, KEDA will need to have GET RBAC permissions on all namespaces in which autoscaling should be supported.

OpenShift HA requirements for KEDA’s metrics-server:

Two replicas
Hard pod anti-affinity on hostname (no two pods on the same node)
Use the maxUnavailable rollout strategy on deployments (prefer 1 by default for a value)
Add PodDisruptionBudget with mindAvailable: 1

Tasks to consider before graduating to GA

Cache query results to reduce the load on the Prometheus server
Spread queries to Prometheus over a time window to prevent bursts
Add a validation webhook validating the PromQL queries

Details

Description

Monitoring tasks:

KEDA tasks:

Set a defaults in the prometheus scaler object if no server address is specified

Authenticate KEDA with the Thanos Query tenancy-specific endpoint

OpenShift HA requirements for KEDA’s metrics-server:

Tasks to consider before graduating to GA

Attachments

Easy Agile Planning Poker

Activity

People

Dates