-
Epic
-
Resolution: Done
-
Critical
-
None
-
None
-
None
-
Expose Broker State
-
False
-
None
-
False
-
No
-
In Progress
-
MGDSRVS-48 - Be able to sustain an external paying customer in production
-
0% To Do, 0% In Progress, 100% Done
-
---
-
---
What
BrokerState metric should be exposed to Prometheus and added to a dashboard so that the support team can understand the state of the broker.
Why
The broker state reveals the current internal state of the broker. This important to understand the state of the service. This is critical information for the SRE when trying to diagnose problems with the service.
- The state the broker is in when it first starts up NOT_RUNNING((byte) 0)
- The state the broker is in when it is catching up with cluster metadata. STARTING((byte) 1)
- The broker has caught up with cluster metadata, but has not yet been unfenced by the controller. RECOVERY((byte) 2)
- The state the broker is in when it has registered at least once, and is accepting client requests. RUNNING((byte) 3)
- The state the broker is in when it is attempting to perform a controlled shutdown. PENDING_CONTROLLED_SHUTDOWN((byte) 6)
- The state the broker is in when it is shutting down. SHUTTING_DOWN((byte) 7),
- The broker is in an unknown state. UNKNOWN((byte) 127)
How
- Expose the Kafka JMX mbean to Prometheus: https://github.com/bf2fc6cc711aee1a0c2a/kas-fleetshard/blob/main/operator/src/main/resources/kafka-metrics.yaml
- Have the metric remote written to Central Observatorim https://github.com/bf2fc6cc711aee1a0c2a/observability-resources-mk/blob/main/resources/prometheus/remote-write.yaml
- Expose the metrics on the dashboard. Include sufficient context on the dashboard so that SRE can understand what the state means.
- Once MGDSTRM-8173 is complete that SOP should consider talking about this metric to help the SRE understand the state of the service.
Done
- Metric expose on the dashboard