-
Bug
-
Resolution: Done
-
Major
-
None
-
4.20.0
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
None
-
None
-
Approved
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
As part of the fallout from OCPBUGS-55445 and OCPBUGS-54222, it is clear we need a better means of monitoring the health of the apiserver and etcd. While we decided to allow the "regression" for the prior two bugs, need to close this gap to eliminate the ambiguity we saw in that run.
TRT has a framework whereby origin can run a prometheus query at the end of the run for you, and include the results in a datafile. That datafile is automatically ingested into a bigquery table and then becomes chartable. In the near future we hope that this could automatically be monitored for regressions in a manner that is fed into component readiness. This happens today in https://github.com/openshift/release/blob/master/ci-operator/step-registry/gather/extra/gather-extra-commands.sh#L361
WARNING: the framework is minimally used today, and may need improvements as we get into things. Will require working closely with TRT.
The catch is that we can only chart a singular value per job run, we cannot work with full time series data as you'd graph in prometheus, so the queries must boil down to useful single values.
This list needs improving/expanding but some initial ideas:
- various percentiles for etcd disk fsync
- percentages of requests hitting various http status codes
- request duration percentiles
- something capturing counts/percentages of request timeouts
After discussion with Nick and David, the request here is to have this gap closed in 4.20. I have marked as a 4.20 release blocker.
- is related to
-
TRT-2031 Enhance Disruption Associated Event & Resource Tracking Support
-
- New
-
-
OCPBUGS-54222 etcdGRPCRequestsSlow alert test regressed
-
- Closed
-
-
OCPBUGS-55445 etcdHighCommitDurations alert test regressed
-
- Closed
-
- relates to
-
OCPBUGS-52968 Component Readiness: [Etcd] [Other] test regressed (excessive took too long messages)
-
- Closed
-