Uploaded image for project: 'OpenShift API Server'
  1. OpenShift API Server
  2. API-1526

Observability: external, internal, and service LB should honor apiserver readiness



    • Epic
    • Resolution: Unresolved
    • Major
    • None
    • None
    • None
    • Observability: measure LB readiness
    • False
    • None
    • False
    • OCPSTRAT-175Observability Improvements of API server (part 1)
    • Not Selected
    • To Do
    • OCPSTRAT-175 - Observability Improvements of API server (part 1)
    • 0
    • 0% 0%



      • CI only (In the future we can discuss how this can be extended to customer cluster)
      • external, internal LB and, and service network LB


      Acceptance Criteria:

      • we should exercise both HTTP/1x and HTTP/2.0 (current disruption test has never been exercising http/2.0)
      • we should exercise requests over existing TCP connection, and new TCP connections
      • Measure how long the different LBs took to react to the readiness change for each apiserver down
      • TRT dashboard to show data over multiple platforms


      detecting in CI, how long it takes the external, internal, and service LB to honor readiness going to false for the kube-apiserver


      • the apiserver can write shutdown related data to the response header for requests that ask for it (maybe using a conditional request header). The disruption test can check the response header, and determine if readiness is being respected. (This does not require any post-job audit log parsing)
      • add a metric that starts recording as soon as the apiserver starts the shutdown process
      • build alert based on the metric, if appropriate
      • add annotation to audit entries for incoming request(s) immediately after the apiserver initiates the shutdown process, the annotation should record the duration (since the shutdown started and up until when the request has arrived)


      In CI, this is our goal:

      • add audit log streaming to openshift/origin
      • add audit log summarization for the streaming to gather the "got even after server was ready"
      • produce data indicating how long the different LBs took to react to the readiness change for each apiserver down
      • produce synthetic test that fails when that duration changes significantly for the worse, with values per platform
      • update the operator to tighten the window to a reasonable multiple of the p99




        Issue Links



              Unassigned Unassigned
              akashem@redhat.com Abu H Kashem
              Deepak Punia Deepak Punia
              0 Vote for this issue
              8 Start watching this issue