-
Epic
-
Resolution: Done
-
Critical
-
None
-
None
-
[UI] Visualize storage per Kafka broker and storage limits per broker in Dashboard UI
-
False
-
-
False
-
To Do
-
MGDSRVS-170 - Improve Monitoring, Metrics and Observability capabilities to support Production Workloads
-
67% To Do, 0% In Progress, 33% Done
Problem Statement
Our current "Used Disk Space" graph in our Dashboard UI shows the total disk space available across brokers. I.e. when you have a OpenShift Streams instance of 1 Streaming Unit (SU), which has a storage limit of 1 TB (pre-replication), the limit in that dashboard is visualized as 1TB, and the storage used is the sum of the storage used by all the Kafka brokers in your OpenShift Streams instance.
This model used to work fine in a 1 SU model, with a default replication factor of 3, as the data, and thus used storage per broker would be evenly distributed across the 3 brokers. So the brokers would approach their storage limit all at the same time.
With the introduction of 2 Streaming Units as part of MGDSRVS-43 this changes, as with a 6 broker Kafka cluster, even distribution of data across the Kafka brokers is no longer guaranteed. It could be that a number of partitions that are heavily used and contain a lot of data are placed on the same broker, and thus a single broker, or a subset of brokers can run out of storage first, far before the others. When one of the brokers starts to run out of storage space, all producers get throttled, to the point where they can long produce messages. When this happens, because of the fact we display the storage as the sum of the storage of all brokers, and because we show the limit of the entire cluster, the user will see that the used storage is nowhere near the limit, but their producers are throttled. This makes it hard for the user to understand the situation and resolve it.
In our Prometheus metrics, we do provide the used storage information per broker:
{ "metric": { "__name__": "kafka_broker_quota_hardlimitbytes", "broker_id": "0", "statefulset_kubernetes_io_pod_name": "samurai-pizza-kafkas-kafka-0", "strimzi_io_cluster": "samurai-pizza-kafkas" }, "timestamp": 1657637981408, "value": 357913941333 }, { "metric": { "__name__": "kafka_broker_quota_hardlimitbytes", "broker_id": "1", "statefulset_kubernetes_io_pod_name": "samurai-pizza-kafkas-kafka-1", "strimzi_io_cluster": "samurai-pizza-kafkas" }, "timestamp": 1657637981408, "value": 357913941333 }, { "metric": { "__name__": "kafka_broker_quota_hardlimitbytes", "broker_id": "2", "statefulset_kubernetes_io_pod_name": "samurai-pizza-kafkas-kafka-2", "strimzi_io_cluster": "samurai-pizza-kafkas" }, "timestamp": 1657637981408, "value": 357913941333 }, ...... { "metric": { "__name__": "kubelet_volume_stats_used_bytes", "persistentvolumeclaim": "data-0-samurai-pizza-kafkas-kafka-2" }, "timestamp": 1657637981627, "value": 72806400 }, { "metric": { "__name__": "kubelet_volume_stats_used_bytes", "persistentvolumeclaim": "data-0-samurai-pizza-kafkas-kafka-1" }, "timestamp": 1657637981627, "value": 72835072 }, { "metric": { "__name__": "kubelet_volume_stats_used_bytes", "persistentvolumeclaim": "data-0-samurai-pizza-kafkas-kafka-0" }, "timestamp": 1657637981627, "value": 72847360 },
So, the user would be able to analyse and understand the situation using our Prometheus metrics. But this obviously requires the user to be using these Prometheus metrics, and to be familiar with them.
We make it unnecessarily hard for users to be able analyse and understand these situations via our UI Dashboard. Since the idea behind the Dashboard was to focus on the Limits of our service, to help our users to use their service instance within these limits, the Dashboard does not fulfil this purpose in the context of larger Kafka instances (i.e. 2 Streaming Units).
Narrative
Shaun is an Ops engineer at FoodRacers online meal delivery services. FoodRacers has been happily using OpenShift Streams in production for the last couple of months. With the growing popularity of their delivery service, they have been outgrowing their 1 Streaming Unit OpenShift Streams instance. In collaboration with Red Hat Services, they've decided to move to a larger Kafka cluster, and have deployed a 2 Streaming Unit OpenShift Streams instance.
At some point during what seems to be normal operations and business as usual, message production to OpenShift Streams instance seems to slow down, and message rates drop below expected rates. Shaun knows about the OpenShift Streams service limits, and suspects the problem is due to Kafka producers being throttled as the brokers are running out of storage space. Shaun opens the Dashboard in the OpenShift Streams Dashboard UI in the Red Hat Hybrid Cloud Console. He checks the Used Storage Graph and sees that the used storage size of his OpenShift Streams instance is still far below the limit. Instead of the total user storage, he now checks the graph that shows the storage limits and used storage per broker. He can clearly see that broker 4 is clearly approaching its storage limit.
Shaun quickly analysis the partition distribution across the brokers and sees that 3 of largest partitions are all assigned to broker 4. He decides to reassign these partitions to other brokers to solve the immediate problem. The next day he has a meeting with the full DevOps team to discuss the long term solution to this problem. It is decided that increasing the number of partitions on the given topics, combined with a better partitioning strategy should solve the problem.
Proposed Solution
The proposed solution is to display both the storage limit as well as the used storage per boker. There are a number of ways we can do this
- Replace the current information in the "Used Storage" graph with storage information per boker. This would mean no longer showing the cluster wide storage information, i.e. the total limit and total used storage. Since the total used storage is the metric on which customers get billed, this solution seems to introduce new problems.
- Add a selection menu, filter or tab in the current graph that would allow the user to drill down from the overall used storage and limits to a used storage and storaged limits graph per broker. Having a filter would allow the user to select only the brokers they would be interested in, which might make it easier for the user to properly analyse the situation.
- Add an additional panel to the dashboard next to the existing "Used Disk Space" panel that would display the storage limits and used storage per broker. Also in this case, being able to filter out certain brokers can be useful, especially when we move to even larger Kafka instances (3,4 or 5 Streaming Units) in the future.
Acceptance Criteria
- The user is able to view the storage limits and the user storage per broker in the Dashboard UI of OpenShift Streams for Apache Kafka.
- The user is able to filter the brokers in this view, allowing to only view the information of the broker's they are interested them to help them to better analyse the problem.