Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-32241

PromQL queries of the ""API Performance" dasboard can overload Thanos queriers

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Undefined
    • None
    • 4.10
    • kube-apiserver
    • No
    • False
    • Hide

      None

      Show
      None
    • API server dashboards are using recording rules to make sure performance data is cached and Thanos is not overloaded on requests spanning multiple days
    • Bug Fix
    • In Progress

    Description

      This is a clone of issue OCPBUGS-25922. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-3986. The following is the description of the original issue:

      Description of problem:

      A customer has reported that the Thanos querier pods would be OOM-killed when loading the API performance dashboard with large time ranges (e.g. >= 1 week) 

      Version-Release number of selected component (if applicable):

      4.10

      How reproducible:

      Always for the customer

      Steps to Reproduce:

      1. Open the "API performance" dashboard in the admin console.
      2. Select a time range of 2 weeks.
      3.
      

      Actual results:

      The dashboard fails to refresh and the thanos-query pods are killed.

      Expected results:

      The dashboard loads without error.

      Additional info:

      The issue arises for the customer because they have very large clusters (hundreds of nodes) which generate lots of metrics.
      In practice the queries executed by the dashboard are costly because they access lots of series (probably > tens of thousands). To make it more efficient, the "upstream" dashboard from kubernetes-monitoring/kubernetes-mixin uses recording rules [1] instead of raw queries. While it decreases a bit the accuracy (one can only distinguish between read & write API requests), it's the only solution to avoid overloading the Thanos query endpoint.
      
      [1] https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/05a58f765eda05902d4f7dd22098a2b870f7ca1e/dashboards/apiserver.libsonnet#L50-L75

       

      Attachments

        Issue Links

          Activity

            People

              vrutkovs@redhat.com Vadim Rutkovsky
              openshift-crt-jira-prow OpenShift Prow Bot
              Ke Wang Ke Wang
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: