Uploaded image for project: 'Subscription Watch'
  1. Subscription Watch
  2. SWATCH-2153

Metrics gathering for rhel-for-x86-els-payg still capturing duplicate data

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Critical
    • 2024-03-06 - API
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • True
    • BIZ-629 - ELS add on for concurrent (non-pay-as-you-go) RHEL offerings

    Description

      Metrics gathering for rhel-for-x86-els-payg is still capturing duplicate data, resulting in a double billing scenario. We thought we solved this already in SWATCH-2109, but that card resolved that specific use case and this turns out to be a more pervasive problem.

      After a lot of collaboration/investigation with the RHOBS team, it's become evident that we've assumed that we can push a lot of business logic off to prometheus by hacking together some complex promql to do core-hour calculations. This has worked out for us for metering openshift, because the cardinality of clusters pushing telemetry. With rhelemeter, this is going to be a problem. These calculations are expensive/time consuming, which makes it unpredictable to know when a recording rule will finish evaluating making those core-hour metrics available. This issue is even more compounded by the nature of Prometheus. It's meant for monitoring and alerting, so it prefers reliability over accuracy.

      The RHOBS team has some achievable & realistic ideas on queries and recording rules for us to give us the average core hours of an instance over an hour. Assuming the billing-related label for an instance aren't changing over that hour, this would be perfect if we're just talking about getting data for billing purposes.

      The challenge is that swatch is also trying to use this data to display to users more fine-grained details of that instance. Was SLA=Premium? Was Usage=Production? Imagine an instance reported using 1 core-hour. At the beginning of the hour the SLA=Premium. Then 30 min later, the instance is reporting using SLA=Standard. If a user is looking at the graph about usage with no filters, it should show 1 core-hour. If a user filters by SLA=Premium, what should it show? 1 core-hour because that was the overall usage for the instance associate with SLA=Premium, or 0.5 core-hour to reflect that instance/SLA combination? Are we reporting/billing/showing usage based on the instance holistically over the hour timeframe - which would treat SLA/Usage as "tags" almost?

      Our current architecture expects all of that nuance has been taken care of during hourly metric gathering, which is expecting that all of that nuance is being taken care of as part of our prometheus query - which is expected to return us one metrics with one value for every hour. This gets very complicated (and possibly not even feasible) when we start taking in consideration all the other labels that might change over the hour.

      We need to collaborate with the RHOBS team to figure out how much is feasible/appropriate to offload to promql and recording rules, and then identify what changes need to be made on the swatch side, while keeping it backwards compatible with fetching metrics from openshift telemeter.

      Attachments

        Activity

          People

            karshah@redhat.com Kartik Shah
            lburnett0 Lindsey Burnett
            Nikhil Kathole Nikhil Kathole
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: