Uploaded image for project: 'OpenShift API Server'
  1. OpenShift API Server
  2. API-1526

Observability: external, internal, and service LB should honor apiserver readiness

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • None
    • Observability: measure LB readiness
    • False
    • None
    • False
    • Not Selected
    • To Do
    • OCPSTRAT-175 - Observability Improvements of API server (part 1)
    • OCPSTRAT-175Observability Improvements of API server (part 1)
    • 100% To Do, 0% In Progress, 0% Done

      Scope:

      • CI only (In the future we can discuss how this can be extended to customer cluster)
      • external, internal LB and, and service network LB

       

      Acceptance Criteria:

      • we should exercise both HTTP/1x and HTTP/2.0 (current disruption test has never been exercising http/2.0)
      • we should exercise requests over existing TCP connection, and new TCP connections
      • Measure how long the different LBs took to react to the readiness change for each apiserver down
      • TRT dashboard to show data over multiple platforms

       

      detecting in CI, how long it takes the external, internal, and service LB to honor readiness going to false for the kube-apiserver

      (ideas)

      • the apiserver can write shutdown related data to the response header for requests that ask for it (maybe using a conditional request header). The disruption test can check the response header, and determine if readiness is being respected. (This does not require any post-job audit log parsing)
      • add a metric that starts recording as soon as the apiserver starts the shutdown process
      • build alert based on the metric, if appropriate
      • add annotation to audit entries for incoming request(s) immediately after the apiserver initiates the shutdown process, the annotation should record the duration (since the shutdown started and up until when the request has arrived)

       

      In CI, this is our goal:

      • add audit log streaming to openshift/origin
      • add audit log summarization for the streaming to gather the "got even after server was ready"
      • produce data indicating how long the different LBs took to react to the readiness change for each apiserver down
      • produce synthetic test that fails when that duration changes significantly for the worse, with values per platform
      • update the operator to tighten the window to a reasonable multiple of the p99

       
       

      Reference:

            Unassigned Unassigned
            akashem@redhat.com Abu H Kashem
            Deepak Punia Deepak Punia
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: