Uploaded image for project: 'Observability Documentation'
  1. Observability Documentation
  2. OBSDOCS-182

Insight into scalability bottlenecks

XMLWordPrintable

      This is the first effort to solve the linked feature request.

      Problem

      We have several foot guns in our stack where users can create resources that over-tax our stack and cause errors or unavailability.

      We have seen

      • Queries that are too expensive, i.e. take too long.
      • Alert rules that take very long to evaluate
      • alert rules that are repeated for many namespaces

      We have guards in place for some. A long running query will time out (though it will still cause significant load). Other issues cause strange, seemingly unrelated issues. Sometimes we struggle to first pinpoint the issue and then communicate to users why they can't do that, when they (seemingly) followed supported procedures.

      The first step to improve this is to actually gain easy insight into these bottlenecks and be able to alert on them.

      https://github.com/thanos-io/thanos/pull/5741 is an example where a new metric is added to thanos-querier that should be useful.

      We need to identify and fill gaps in our metric set and then leverage these metrics to notify users before things break.

      Docs

      • Describe (new/improved) alerts relating to scalability
      • Document scalability expectations

            rhn-support-bburt Brian Burt
            rhn-support-bburt Brian Burt
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: