Uploaded image for project: 'Managed Service - Streams'
  1. Managed Service - Streams
  2. MGDSTRM-10459

Alert on unexpected haproxy configuration reloads (that drop connections)

XMLWordPrintable

    • MK - Sprint 233

      WHAT

      If openshift-router decides that it is unable to reconfig haproxy dynamically, it resorts to restarting the haproxy process. This causes all existing customer connections to be dropped. We have configured openshift-routers in such a way that haproxy being restarted should be rare. Let's have an alert to tell us if this is ever not the case.

      WHY

      haproxy config reloads disconnect established kafka connections, which is disruptive to customer applications

      HOW

      It appears that template_router_reload_seconds_count might be what we are looking for.
      This is the count of the number of observations, so appears to corresponding to the number of reloads.

      increase(template_router_reload_seconds_count{job=~".*kas.*"}[5m]) > 0}

      We'd need to make sure we understand the behaviour of the counter:

      • how does it behave on first startup
      • how does it increment after that

      turning off the dynamicConfigManager will help us learn that.

      We should then be able add a new alert.

      We also need a SOP. If the query fires, restarting ingress pods might help. If that fails SRE should dial Engineering.

      DONE

      • Alert/unit test
      • SOP written.

              keithbwall Keith Wall
              keithbwall Keith Wall
              Kafka Integrations
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: