-
Epic
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
Job metrics for etcd health
-
In Progress
-
None
-
0% To Do, 0% In Progress, 100% Done
-
False
-
-
False
-
None
-
None
Epic Goal*
Add new job level metrics that can help identify and differentiate actual etcd regressions in our CI from platform related issues such as slow or overloaded disks/network.
Why is this important?
From the Original bug - OCPBUGS-55755:
TRT has a framework whereby origin can run a prometheus query at the end of the run for you, and include the results in a datafile. That datafile is automatically ingested into a bigquery table and then becomes chartable. In the near future we hope that this could automatically be monitored for regressions in a manner that is fed into component readiness. This happens today in https://github.com/openshift/release/blob/master/ci-operator/step-registry/gather/extra/gather-extra-commands.sh#L361
WARNING: the framework is minimally used today, and may need improvements as we get into things. Will require working closely with TRT.
The catch is that we can only chart a singular value per job run, we cannot work with full time series data as you'd graph in prometheus, so the queries must boil down to useful single values.
This list needs improving/expanding but some initial ideas:
various percentiles for etcd disk fsync
percentages of requests hitting various http status codes
request duration percentiles
something capturing counts/percentages of request timeouts
Acceptance Criteria
New scalar metrics collected per job run are visible at https://grafana-loki.ci.openshift.org/d/41AejWpSz/metrics-investigation?orgId=1
No docs or QE required as this is not a user facing feature or involves any API changes