Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: 2024-03-06 - API
Affects Version/s: None
Component/s: None
Labels:
- automated
- qe
- refined

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
True
Epic Link:
SWATCH-1602
Feature Link:
BIZ-629 - ELS add on for concurrent (non-pay-as-you-go) RHEL offerings
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Metrics gathering for rhel-for-x86-els-payg is still capturing duplicate data, resulting in a double billing scenario. We thought we solved this already in SWATCH-2109, but that card resolved that specific use case and this turns out to be a more pervasive problem.

After a lot of collaboration/investigation with the RHOBS team, it's become evident that we've assumed that we can push a lot of business logic off to prometheus by hacking together some complex promql to do core-hour calculations. This has worked out for us for metering openshift, because the cardinality of clusters pushing telemetry. With rhelemeter, this is going to be a problem. These calculations are expensive/time consuming, which makes it unpredictable to know when a recording rule will finish evaluating making those core-hour metrics available. This issue is even more compounded by the nature of Prometheus. It's meant for monitoring and alerting, so it prefers reliability over accuracy.

The RHOBS team has some achievable & realistic ideas on queries and recording rules for us to give us the average core hours of an instance over an hour. Assuming the billing-related label for an instance aren't changing over that hour, this would be perfect if we're just talking about getting data for billing purposes.

The challenge is that swatch is also trying to use this data to display to users more fine-grained details of that instance. Was SLA=Premium? Was Usage=Production? Imagine an instance reported using 1 core-hour. At the beginning of the hour the SLA=Premium. Then 30 min later, the instance is reporting using SLA=Standard. If a user is looking at the graph about usage with no filters, it should show 1 core-hour. If a user filters by SLA=Premium, what should it show? 1 core-hour because that was the overall usage for the instance associate with SLA=Premium, or 0.5 core-hour to reflect that instance/SLA combination? Are we reporting/billing/showing usage based on the instance holistically over the hour timeframe - which would treat SLA/Usage as "tags" almost?

Our current architecture expects all of that nuance has been taken care of during hourly metric gathering, which is expecting that all of that nuance is being taken care of as part of our prometheus query - which is expected to return us one metrics with one value for every hour. This gets very complicated (and possibly not even feasible) when we start taking in consideration all the other labels that might change over the hour.

We need to collaborate with the RHOBS team to figure out how much is feasible/appropriate to offload to promql and recording rules, and then identify what changes need to be made on the swatch side, while keeping it backwards compatible with fetching metrics from openshift telemeter.

mentioned on

Merge request - SWATCH-2153 - Modify rhelemeter template

Assignee:: Kartik Shah

Reporter:: Lindsey Burnett

QA Contact:: Nikhil Kathole

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/01/26 5:48 PM

Updated:: 2024/05/23 8:37 AM

Resolved:: 2024/02/12 1:14 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates