Loading...

XML

Word

Printable

Type: Story
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Labels:
- TechDebt
- no-qe

Activity Type:
Future Sustainability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
Metrics enhanced
Story Points:
None

Target Version:
None
Release Blocker:
None
Sprint:
None

Our metrics story is a bloody mess today.

Let's start with where we would like to go, and work our way backwards to figure out how to get there (if we even can).

Observation vs. Alerting

We need metrics for two main purposes:
1. To gather baseline data, look at trends, that kind of thing. Canonical example: how long does resume-from-hibernation "normally" take? So we can e.g. determine a P99 for SLO. For this, we want to "observe" the metric every time we come across it.
2. To inform alerting and SLA breach. Here we want the alertmanager syntax to be able to decide when a value is outside of the "expected" range – that range typically calculated based on #1.

If it weren't for SRE, none of this would be particularly complicated. However...

"Which ClusterDeployment?" and the Cardinality Concern

SRE-P deals with thousands of clusters. When they get an alert like "at least one cluster is taking too long to resume from hibernation", they are going to need to know which clusters are affected. Absent some cluster-specific indicator on the metric, they have to log in (to several shards maybe?) and run bespoke queries to find them.

So for these cases, we want to include a label that uniquely identifies the cluster. However, this blows up prometheus cardinality (won't restate the reasons here) so it's not reasonable or practical to do by default / across all metrics.

Solution: Duration-Based Optional Reporting

One solution to the cardinality concern, specific to metrics that measure a duration ("how long did X take"? or "how long is X taking? (while it's still in progress)") has been to include a cluster-identifying label, but make the metric not report at all by default. To get it reporting, the user has to add configuration with a threshold value: we'll only report the metric if it exceeds that value (the P99 or whatever).

Solution: Optional Labels

Another way to address cardinality concerns is to put the burden fully on the user. Support configuration such that the user can dynamically request labels – and then they can use labels that uniquely identify the cluster if they wish.

Note that our current solution for this is across the board: if you configure AdditionalClusterDeploymentLabels, they will apply to all metrics that support optional labels. I.e. you can't set up different labels for different metrics. This is already a problem, and it's only going to get worse.

The Fairy Tale

It would be cool if we had duration metrics in pairs.
One of the pair always reports, and has minimal labels. It is used to discover baselines/trends.
The other is optional, configurable with a threshold and with optional labels. And the configurability is per metric. This is used for alerting.

Backward Compatibility

Procedurally, we can't just go changing this willy-nilly.
We have consumers (notably OSD) relying on existing metrics, including their spelling, the existence and spelling of their label keys, and their configuration knobs in HiveConfig.
We might be able to get away with changing the spelling of metrics and labels, in careful coordination with OSD (alerting rules in app-interface).
Changing the HiveConfig API is another matter.

I'll stop here for now. This needs lots of careful thought and design.

is depended on by

HIVE-2285 Identify clusters that are in limited support on Hive clusters

Closed

links to

openshift/hive#2182: Enhancement for metricsConfig redesign

Assignee:: Suhani Mehta

Reporter:: Eric Fried

Need Info From:: None

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2023/10/26 9:47 PM

Updated:: 2025/07/04 1:15 PM

Resolved:: 2024/05/16 3:57 PM

Details

Description

Observation vs. Alerting

"Which ClusterDeployment?" and the Cardinality Concern

Solution: Duration-Based Optional Reporting

Solution: Optional Labels

The Fairy Tale

Backward Compatibility

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates