Loading...

XML

Word

Printable

Type: Epic
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- no-doc,no-qe,no-feature

Epic Name:
Job metrics for etcd health
Epic Status:
In Progress
Activity Type:
None
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Size:
None

Target Version:

openshift-4.21
Release Blocker:
None

Epic Goal*

Add new job level metrics that can help identify and differentiate actual etcd regressions in our CI from platform related issues such as slow or overloaded disks/network.

Why is this important?

From the Original bug - OCPBUGS-55755:

TRT has a framework whereby origin can run a prometheus query at the end of the run for you, and include the results in a datafile. That datafile is automatically ingested into a bigquery table and then becomes chartable. In the near future we hope that this could automatically be monitored for regressions in a manner that is fed into component readiness. This happens today in https://github.com/openshift/release/blob/master/ci-operator/step-registry/gather/extra/gather-extra-commands.sh#L361

WARNING: the framework is minimally used today, and may need improvements as we get into things. Will require working closely with TRT.

The catch is that we can only chart a singular value per job run, we cannot work with full time series data as you'd graph in prometheus, so the queries must boil down to useful single values.

This list needs improving/expanding but some initial ideas:

various percentiles for etcd disk fsync
percentages of requests hitting various http status codes
request duration percentiles
something capturing counts/percentages of request timeouts

Acceptance Criteria

New scalar metrics collected per job run are visible at https://grafana-loki.ci.openshift.org/d/41AejWpSz/metrics-investigation?orgId=1

No docs or QE required as this is not a user facing feature or involves any API changes

links to

openshift/release#71239: CNTRLPLANE-1932: gather P50, P95 and P99 for etcd disk metrics from all CI runs

Assignee:: Haseeb Tariq

Reporter:: Haseeb Tariq

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2025/11/14 5:17 AM

Updated:: 2025/11/18 5:07 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty