Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-3986

PromQL queries of the ""API Performance" dasboard can overload Thanos queriers

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Undefined Undefined
    • 4.14.0
    • 4.10
    • kube-apiserver
    • None
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, observability dashboards used expensive queries to show data which caused frequent timeouts on clusters with a large number of nodes. With this release, observability dashboards use recording rules that are precalculated to ensure reliability on clusters with a large number of nodes. (link:https://issues.redhat.com/browse/OCPBUGS-3986[*OCPBUGS-3986*])
      Show
      * Previously, observability dashboards used expensive queries to show data which caused frequent timeouts on clusters with a large number of nodes. With this release, observability dashboards use recording rules that are precalculated to ensure reliability on clusters with a large number of nodes. (link: https://issues.redhat.com/browse/OCPBUGS-3986 [* OCPBUGS-3986 *])
    • Bug Fix
    • Done

      Description of problem:

      A customer has reported that the Thanos querier pods would be OOM-killed when loading the API performance dashboard with large time ranges (e.g. >= 1 week) 

      Version-Release number of selected component (if applicable):

      4.10

      How reproducible:

      Always for the customer

      Steps to Reproduce:

      1. Open the "API performance" dashboard in the admin console.
      2. Select a time range of 2 weeks.
      3.
      

      Actual results:

      The dashboard fails to refresh and the thanos-query pods are killed.

      Expected results:

      The dashboard loads without error.

      Additional info:

      The issue arises for the customer because they have very large clusters (hundreds of nodes) which generate lots of metrics.
      In practice the queries executed by the dashboard are costly because they access lots of series (probably > tens of thousands). To make it more efficient, the "upstream" dashboard from kubernetes-monitoring/kubernetes-mixin uses recording rules [1] instead of raw queries. While it decreases a bit the accuracy (one can only distinguish between read & write API requests), it's the only solution to avoid overloading the Thanos query endpoint.
      
      [1] https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/05a58f765eda05902d4f7dd22098a2b870f7ca1e/dashboards/apiserver.libsonnet#L50-L75

       

              vrutkovs@redhat.com Vadim Rutkovsky
              spasquie@redhat.com Simon Pasquier
              Deepak Punia Deepak Punia (Inactive)
              Votes:
              2 Vote for this issue
              Watchers:
              14 Start watching this issue

                Created:
                Updated:
                Resolved: