Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.7
Component/s: Telemeter
Labels:
- migrated_from_bz

Severity:
Moderate
Regression:
None
Release Blocker:
Rejected
Architecture:

Unspecified
Release Note Type:
If docs needed, set a value

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

The increase() and rate() functions gave different results in a live Thanos aggregation than it did in a cluster that was direct writing to Thanos:

Metric in telemeter is a new test counter being used to verify if we can efficiently summarize on cluster metrics before sending for accurate counting on a remote cluster:

cluster:usage:workload:capacity_physical_cpu_core_seconds

Samples from infogw-proxy @ 2020-11-19 12:00

7835520 @1605781178
7898160 @1605781448
1356720 @1605781590
1419360 @1605781871
1453200 @1605782141

The increase function appears to incorrectly handle the counter reset (as Prometheus would handle correctly), from the same time @ 2020-11-19 12:00

(max by (_id) (increase(cluster:usage:workload:capacity_physical_cpu_core_seconds[1h])) / (3600))[2h:5m]
{_id="5658742f-ac1f-41ca-9fd5-a836ddadddf5"} 257.1005552128316 @1605780000
255.10178901912403 @1605780300
253.25107958050586 @1605780600
250.58605798889576 @1605780900
248.40546697038727 @1605781200
245.74031890660595 @1605781500
635.6028368794325 @1605781800

^ expected to be 244 or similar

665.4719999999999 @1605782100
654.8736 @1605782400
619.4050073637702 @1605782700
617.96110783736 @1605783000
616.4761343547436 @1605783300
615.2032999410725 @1605783600
613.2 @1605783900
612.4941176470587 @1605784200
611.0823529411764 @1605784500
643.0680728667305 @1605784800

starts being correct below

222.6097635861222 @1605785100
227.23912425362525 @1605785400
237.33864088711968 @1605785700
239.79528006823998 @1605786000
243.09487258213076 @1605786300
244.48804414469652 @1605786600
245.36929206251915 @1605786900
246.48449493398832 @1605787200

On the underlying cluster using direct write the values from the recording rule generating the samples look correct, although there is a longer gap at the moment that the discontinuity appears (which may indicate something changed???)

7905120 @1605781457.35
7912080 @1605781487.35
7918800 @1605781517.35
7925520 @1605781547.35
7932240 @1605781577.35

... // why the gap?

7939200 @1605781727.35
7946160 @1605781757.35
7953120 @1605781787.35

The on cluster increase() looked correct, and other resets also looked correct.

Setting to high because this may be a bug in thanos and block other calculations, but it also blocks understanding whether we can rely on counters in thanos for billing purposes.

Assignee:: OpenShift Jira Bot

Reporter:: OpenShift Jira Bot

QA Contact:: Junqi Zhao

Contributing Groups:: Red Hat Employee

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2020/11/19 7:49 PM

Updated:: 2023/04/01 4:52 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates