-
Epic
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
Expose Log Recovery Metrics
-
True
-
Awaiting Kafka 3.3.0.
-
False
-
No
-
To Do
-
MGDSRVS-48 - Be able to sustain an external paying customer in production
-
---
-
---
WHAT
KIP-831 exposes the log recovery metrics, which is helpful for support to monitor the log recovery progress since it might take hours to complete. The service should expose these metrics to the support dashboard so support users can better understand the state of a kafka instance.
WHY
Log recovery is a process when a broker start up, if it has previous unclean shutdown, it'll be triggered to make sure the log is in a good state and not get corrupted. If the broker stores a lot of logs, the log recovery process might take hours or days for the log recovery completion. So far, we don't have any way to know how far away from completion. So this metrics will help the support team know about the progress of log recovery.
HOW
1. Expose the Kafka JMX mbean to Prometheus: https://github.com/bf2fc6cc711aee1a0c2a/kas-fleetshard/blob/main/operator/src/main/resources/kafka-metrics.yaml
2. Have the metric remote written to Central Observatorim https://github.com/bf2fc6cc711aee1a0c2a/observability-resources-mk/blob/main/resources/prometheus/remote-write.yaml
3. Expose the metrics on the dashboard. Include sufficient context on the dashboard so that SRE can understand what the state means.
DONE
- Metrics exposed to support dashboard.
- is blocked by
-
MGDSTRM-9427 Upgrade RHOSAK service from Strimzi 0.29 to 0.32.0 / Kafka 3.2.3 to 3.3.1
- Closed
- is related to
-
MGDSTRM-9195 Reduce return to service time following abnormal broker shutdown
- Closed
- relates to
-
MGDSTRM-9024 Monitor produceridCount metrics and add alerts (Kafka 3.4.0)
- Backlog