-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.7
-
Moderate
-
None
-
Rejected
-
Unspecified
-
If docs needed, set a value
The increase() and rate() functions gave different results in a live Thanos aggregation than it did in a cluster that was direct writing to Thanos:
Metric in telemeter is a new test counter being used to verify if we can efficiently summarize on cluster metrics before sending for accurate counting on a remote cluster:
cluster:usage:workload:capacity_physical_cpu_core_seconds
Samples from infogw-proxy @ 2020-11-19 12:00
7835520 @1605781178
7898160 @1605781448
1356720 @1605781590
1419360 @1605781871
1453200 @1605782141
The increase function appears to incorrectly handle the counter reset (as Prometheus would handle correctly), from the same time @ 2020-11-19 12:00
(max by (_id) (increase(cluster:usage:workload:capacity_physical_cpu_core_seconds[1h])) / (3600))[2h:5m]
{_id="5658742f-ac1f-41ca-9fd5-a836ddadddf5"} 257.1005552128316 @1605780000
255.10178901912403 @1605780300
253.25107958050586 @1605780600
250.58605798889576 @1605780900
248.40546697038727 @1605781200
245.74031890660595 @1605781500
635.6028368794325 @1605781800
^ expected to be 244 or similar
665.4719999999999 @1605782100
654.8736 @1605782400
619.4050073637702 @1605782700
617.96110783736 @1605783000
616.4761343547436 @1605783300
615.2032999410725 @1605783600
613.2 @1605783900
612.4941176470587 @1605784200
611.0823529411764 @1605784500
643.0680728667305 @1605784800
- starts being correct below
222.6097635861222 @1605785100
227.23912425362525 @1605785400
237.33864088711968 @1605785700
239.79528006823998 @1605786000
243.09487258213076 @1605786300
244.48804414469652 @1605786600
245.36929206251915 @1605786900
246.48449493398832 @1605787200
On the underlying cluster using direct write the values from the recording rule generating the samples look correct, although there is a longer gap at the moment that the discontinuity appears (which may indicate something changed???)
7905120 @1605781457.35
7912080 @1605781487.35
7918800 @1605781517.35
7925520 @1605781547.35
7932240 @1605781577.35
... // why the gap?
7939200 @1605781727.35
7946160 @1605781757.35
7953120 @1605781787.35
The on cluster increase() looked correct, and other resets also looked correct.
Setting to high because this may be a bug in thanos and block other calculations, but it also blocks understanding whether we can rely on counters in thanos for billing purposes.