Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-8819

Thanos has incorrect increase() and rate() across counter metric reset when queried in production

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.7
    • Telemeter
    • Moderate
    • Rejected
    • Unspecified
    • If docs needed, set a value

      The increase() and rate() functions gave different results in a live Thanos aggregation than it did in a cluster that was direct writing to Thanos:

      Metric in telemeter is a new test counter being used to verify if we can efficiently summarize on cluster metrics before sending for accurate counting on a remote cluster:

      cluster:usage:workload:capacity_physical_cpu_core_seconds

      Samples from infogw-proxy @ 2020-11-19 12:00

      7835520 @1605781178
      7898160 @1605781448
      1356720 @1605781590
      1419360 @1605781871
      1453200 @1605782141

      The increase function appears to incorrectly handle the counter reset (as Prometheus would handle correctly), from the same time @ 2020-11-19 12:00

      (max by (_id) (increase(cluster:usage:workload:capacity_physical_cpu_core_seconds[1h])) / (3600))[2h:5m]
      {_id="5658742f-ac1f-41ca-9fd5-a836ddadddf5"} 257.1005552128316 @1605780000
      255.10178901912403 @1605780300
      253.25107958050586 @1605780600
      250.58605798889576 @1605780900
      248.40546697038727 @1605781200
      245.74031890660595 @1605781500
      635.6028368794325 @1605781800

      ^ expected to be 244 or similar

      665.4719999999999 @1605782100
      654.8736 @1605782400
      619.4050073637702 @1605782700
      617.96110783736 @1605783000
      616.4761343547436 @1605783300
      615.2032999410725 @1605783600
      613.2 @1605783900
      612.4941176470587 @1605784200
      611.0823529411764 @1605784500
      643.0680728667305 @1605784800

      • starts being correct below

      222.6097635861222 @1605785100
      227.23912425362525 @1605785400
      237.33864088711968 @1605785700
      239.79528006823998 @1605786000
      243.09487258213076 @1605786300
      244.48804414469652 @1605786600
      245.36929206251915 @1605786900
      246.48449493398832 @1605787200

      On the underlying cluster using direct write the values from the recording rule generating the samples look correct, although there is a longer gap at the moment that the discontinuity appears (which may indicate something changed???)

      7905120 @1605781457.35
      7912080 @1605781487.35
      7918800 @1605781517.35
      7925520 @1605781547.35
      7932240 @1605781577.35

      ... // why the gap?

      7939200 @1605781727.35
      7946160 @1605781757.35
      7953120 @1605781787.35

      The on cluster increase() looked correct, and other resets also looked correct.

      Setting to high because this may be a bug in thanos and block other calculations, but it also blocks understanding whether we can rely on counters in thanos for billing purposes.

            openshift_jira_bot OpenShift Jira Bot
            openshift_jira_bot OpenShift Jira Bot
            Junqi Zhao Junqi Zhao
            Red Hat Employee
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: