-
Story
-
Resolution: Done
-
Normal
-
None
-
None
-
False
-
None
-
False
-
-
Our metrics story is a bloody mess today.
Let's start with where we would like to go, and work our way backwards to figure out how to get there (if we even can).
Observation vs. Alerting
We need metrics for two main purposes:
1. To gather baseline data, look at trends, that kind of thing. Canonical example: how long does resume-from-hibernation "normally" take? So we can e.g. determine a P99 for SLO. For this, we want to "observe" the metric every time we come across it.
2. To inform alerting and SLA breach. Here we want the alertmanager syntax to be able to decide when a value is outside of the "expected" range – that range typically calculated based on #1.
If it weren't for SRE, none of this would be particularly complicated. However...
"Which ClusterDeployment?" and the Cardinality Concern
SRE-P deals with thousands of clusters. When they get an alert like "at least one cluster is taking too long to resume from hibernation", they are going to need to know which clusters are affected. Absent some cluster-specific indicator on the metric, they have to log in (to several shards maybe?) and run bespoke queries to find them.
So for these cases, we want to include a label that uniquely identifies the cluster. However, this blows up prometheus cardinality (won't restate the reasons here) so it's not reasonable or practical to do by default / across all metrics.
Solution: Duration-Based Optional Reporting
One solution to the cardinality concern, specific to metrics that measure a duration ("how long did X take"? or "how long is X taking? (while it's still in progress)") has been to include a cluster-identifying label, but make the metric not report at all by default. To get it reporting, the user has to add configuration with a threshold value: we'll only report the metric if it exceeds that value (the P99 or whatever).
Solution: Optional Labels
Another way to address cardinality concerns is to put the burden fully on the user. Support configuration such that the user can dynamically request labels – and then they can use labels that uniquely identify the cluster if they wish.
Note that our current solution for this is across the board: if you configure AdditionalClusterDeploymentLabels, they will apply to all metrics that support optional labels. I.e. you can't set up different labels for different metrics. This is already a problem, and it's only going to get worse.
The Fairy Tale
It would be cool if we had duration metrics in pairs.
One of the pair always reports, and has minimal labels. It is used to discover baselines/trends.
The other is optional, configurable with a threshold and with optional labels. And the configurability is per metric. This is used for alerting.
Backward Compatibility
Procedurally, we can't just go changing this willy-nilly.
We have consumers (notably OSD) relying on existing metrics, including their spelling, the existence and spelling of their label keys, and their configuration knobs in HiveConfig.
We might be able to get away with changing the spelling of metrics and labels, in careful coordination with OSD (alerting rules in app-interface).
Changing the HiveConfig API is another matter.
I'll stop here for now. This needs lots of careful thought and design.
- is depended on by
-
HIVE-2285 Identify clusters that are in limited support on Hive clusters
- Closed
- links to