Loading...

XML

Word

Printable

Epic Name:
Observability: measure LB readiness
Work Type:
BU Product Work
Blocked:
False
Blocked Reason:
None
Ready:
False
Color Status:
Not Selected
Epic Status:
To Do
Feature Link:
OCPSTRAT-175 - Observability Improvements of API server (part 1)
Parent Link:
OCPSTRAT-175Observability Improvements of API server (part 1)
Hierarchy Progress Bar:

100% To Do, 0% In Progress, 0% Done

Scope:

CI only (In the future we can discuss how this can be extended to customer cluster)
external, internal LB and, and service network LB

Acceptance Criteria:

we should exercise both HTTP/1x and HTTP/2.0 (current disruption test has never been exercising http/2.0)
we should exercise requests over existing TCP connection, and new TCP connections
Measure how long the different LBs took to react to the readiness change for each apiserver down
TRT dashboard to show data over multiple platforms

detecting in CI, how long it takes the external, internal, and service LB to honor readiness going to false for the kube-apiserver

(ideas)

the apiserver can write shutdown related data to the response header for requests that ask for it (maybe using a conditional request header). The disruption test can check the response header, and determine if readiness is being respected. (This does not require any post-job audit log parsing)
add a metric that starts recording as soon as the apiserver starts the shutdown process
build alert based on the metric, if appropriate
add annotation to audit entries for incoming request(s) immediately after the apiserver initiates the shutdown process, the annotation should record the duration (since the shutdown started and up until when the request has arrived)

In CI, this is our goal:

~~add audit log streaming to openshift/origin~~
~~add audit log summarization for the streaming to gather the "got even after server was ready"~~
produce data indicating how long the different LBs took to react to the readiness change for each apiserver down
produce synthetic test that fails when that duration changes significantly for the worse, with values per platform
update the operator to tighten the window to a reasonable multiple of the p99

Reference:

incorporates

OCPSTRAT-175 Observability Improvements of API server (part 1)

is related to

OCPBUGS-13158 Run in-cluster disruption tests

1.	convert existing disruption tests to the new framework		New		Unassigned
2.	investigate API disruption on azure (using new disruption test)		New		Unassigned