-
Epic
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
Observability: measure LB readiness
-
BU Product Work
-
False
-
None
-
False
-
Not Selected
-
To Do
-
OCPSTRAT-175 - Observability Improvements of API server (part 1)
-
OCPSTRAT-175Observability Improvements of API server (part 1)
-
100% To Do, 0% In Progress, 0% Done
Scope:
- CI only (In the future we can discuss how this can be extended to customer cluster)
- external, internal LB and, and service network LB
Acceptance Criteria:
- we should exercise both HTTP/1x and HTTP/2.0 (current disruption test has never been exercising http/2.0)
- we should exercise requests over existing TCP connection, and new TCP connections
- Measure how long the different LBs took to react to the readiness change for each apiserver down
- TRT dashboard to show data over multiple platforms
detecting in CI, how long it takes the external, internal, and service LB to honor readiness going to false for the kube-apiserver
(ideas)
- the apiserver can write shutdown related data to the response header for requests that ask for it (maybe using a conditional request header). The disruption test can check the response header, and determine if readiness is being respected. (This does not require any post-job audit log parsing)
- add a metric that starts recording as soon as the apiserver starts the shutdown process
- build alert based on the metric, if appropriate
- add annotation to audit entries for incoming request(s) immediately after the apiserver initiates the shutdown process, the annotation should record the duration (since the shutdown started and up until when the request has arrived)
In CI, this is our goal:
add audit log streaming to openshift/originadd audit log summarization for the streaming to gather the "got even after server was ready"- produce data indicating how long the different LBs took to react to the readiness change for each apiserver down
- produce synthetic test that fails when that duration changes significantly for the worse, with values per platform
- update the operator to tighten the window to a reasonable multiple of the p99
Reference:
- https://github.com/openshift/kubernetes/pull/1456
- https://github.com/openshift/origin/pull/27687
- https://github.com/openshift/kubernetes/blob/7dab57f2302ec0f94b648ccb48b0fa67c2befbb3/staging/src/k8s.io/apiserver/pkg/server/patch_genericapiserver.go#L82-L99
- metrics in ci.search: https://github.com/openshift/release/pull/25375
- Dashboards: https://github.com/openshift/cluster-kube-apiserver-operator/pull/1271
- Rules for telemetry: https://github.com/openshift/cluster-kube-apiserver-operator/pull/1272
- incorporates
-
OCPSTRAT-175 Observability Improvements of API server (part 1)
- Backlog
- is related to
-
OCPBUGS-13158 Run in-cluster disruption tests
- Closed
1.
|
convert existing disruption tests to the new framework | New | Unassigned | ||
2.
|
investigate API disruption on azure (using new disruption test) | New | Unassigned |