-
Task
-
Resolution: Done
-
Critical
-
None
-
None
-
None
-
False
-
None
-
False
-
No
-
---
-
---
-
-
-
MK - Sprint 233
WHAT
If openshift-router decides that it is unable to reconfig haproxy dynamically, it resorts to restarting the haproxy process. This causes all existing customer connections to be dropped. We have configured openshift-routers in such a way that haproxy being restarted should be rare. Let's have an alert to tell us if this is ever not the case.
WHY
haproxy config reloads disconnect established kafka connections, which is disruptive to customer applications
HOW
It appears that template_router_reload_seconds_count might be what we are looking for.
This is the count of the number of observations, so appears to corresponding to the number of reloads.
increase(template_router_reload_seconds_count{job=~".*kas.*"}[5m]) > 0}
We'd need to make sure we understand the behaviour of the counter:
- how does it behave on first startup
- how does it increment after that
turning off the dynamicConfigManager will help us learn that.
We should then be able add a new alert.
We also need a SOP. If the query fires, restarting ingress pods might help. If that fails SRE should dial Engineering.
DONE
- Alert/unit test
- SOP written.
- relates to
-
MGDSTRM-9181 Ingress disconnects established connections whenever kafka instances are provisioned/deprovisioned
- Closed
- mentioned on