Uploaded image for project: 'Managed Service - Streams'
  1. Managed Service - Streams
  2. MGDSTRM-8296

Alert when at least one of the zookeeper pods is unready for X minutes

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • None
    • MK - Sprint 231

      Overview
      This is a proposal for an alert or alter current alert that fires when one of the zookeeper pods is down for X minutes. Currently, the alert ZookeeperContainersDown only fires when more than one zookeeper pods are down for 10 minutes. While this makes a lot of sense since zookeeper is designed to be HA, there was an incident observed recently that when one of the zookeeper pod was down for at least 40 minutes (before it was restarted), the strimzi-cluster-operator was affected and caused to fire the StrimziKafkaStuck alert.

      The proposed alert will;
      1. Fire when a zookeeper pod is down for X mins.
      2. Help us investigate and get to the bottom of why a zookeeper pod is down as soon as possible and prevent further compounded failures (e.g. the case of the strimzi-cluster-operator).

            agullon Alejandro Gullón
            jcueto@redhat.com Jose Cueto
            MK - Running the Service
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: