Uploaded image for project: 'OpenStack as Infra'
  1. OpenStack as Infra
  2. OSASINFRA-2366

Kuryr: Critical failure alerts

XMLWordPrintable

    • Kuryr: Critical failures alerting
    • False
    • False
    • Done
    • 0% To Do, 0% In Progress, 100% Done
    • Undefined

      Currently when an intermittent failure is happening (like Neutron port never becoming ACTIVE or LB being stuck in PENDING_UPDATE) Kuryr just logs the error (in an ambiguous way) and fails a liveness probe. This results in us receiving a constant stream of bugreports about Kuryr containers being in a CrashLoop as users and support is unable to pinpoint that the problem is with Neutron or Octavia.

      This epic is about creating new Kuryr metric that will report the number of such errors instead. The errors should still be logged (in a way that clearly indicates unrecoverable problem that Kuryr cannot itself fix) and reported in a metric that will raise an alert if even one such problem persists. Once that's done we should probably stop failing a liveness probe if that happens as Kuryr itself works as expected.

          1.
          QE Tracker Sub-task Closed Undefined Unassigned
          2.
          Docs Tracker Sub-task Closed Undefined Unassigned
          3.
          TE Tracker Sub-task Closed Undefined Unassigned

              rdobosz Roman Dobosz
              mdulko MichaƂ Dulko (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: