Uploaded image for project: 'OpenShift Monitoring'
  1. OpenShift Monitoring
  2. MON-2851

Insight into scalability bottlenecks

    XMLWordPrintable

Details

    • Epic
    • Resolution: Obsolete
    • Normal
    • None
    • None
    • None
    • None
    • scalability insight
    • False
    • None
    • False
    • Not Selected
    • NEW
    • To Do
    • MON-3159Technical Debt
    • NEW
    • 0
    • 0% 0%
    • 0

    Description

      This is the first effort to solve the linked feature request.

      Problem

      We have several foot guns in our stack where users can create resources that over-tax our stack and cause errors or unavailability.

      We have seen

      • Queries that are too expensive, i.e. take too long.
      • Alert rules that take very long to evaluate
      • alert rules that are repeated for many namespaces

      We have guards in place for some. A long running query will time out (though it will still cause significant load). Other issues cause strange, seemingly unrelated issues. Sometimes we struggle to first pinpoint the issue and then communicate to users why they can't do that, when they (seemingly) followed supported procedures.

      The first step to improve this is to actually gain easy insight into these bottlenecks and be able to alert on them.

      https://github.com/thanos-io/thanos/pull/5741 is an example where a new metric is added to thanos-querier that should be useful.

      We need to identify and fill gaps in our metric set and then leverage these metrics to notify users before things break.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jfajersk@redhat.com Jan Fajerski
              Junqi Zhao Junqi Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: