-
Epic
-
Resolution: Done
-
Normal
-
None
-
Kuryr: Critical failures alerting
-
False
-
False
-
Done
-
0% To Do, 0% In Progress, 100% Done
-
Undefined
Currently when an intermittent failure is happening (like Neutron port never becoming ACTIVE or LB being stuck in PENDING_UPDATE) Kuryr just logs the error (in an ambiguous way) and fails a liveness probe. This results in us receiving a constant stream of bugreports about Kuryr containers being in a CrashLoop as users and support is unable to pinpoint that the problem is with Neutron or Octavia.
This epic is about creating new Kuryr metric that will report the number of such errors instead. The errors should still be logged (in a way that clearly indicates unrecoverable problem that Kuryr cannot itself fix) and reported in a metric that will raise an alert if even one such problem persists. Once that's done we should probably stop failing a liveness probe if that happens as Kuryr itself works as expected.