Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-366 Track operator watch requests more accurately with less maintenance
  3. TRT-418

Provide TRT with script and instructions on how to process audit log data and calculate percentiles

XMLWordPrintable

    • Icon: Sub-task Sub-task
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • False
    • None
    • False

      Idea: calculate percentiles as part of gathering artifacts from a cluster

      Options:

      1. create a new step in the CI operator step registry to process an audit-logs.tar.gz archive produced by gather-audit-logs step
      2. process the audit logs right after the logs are pulled by must-gather: https://github.com/openshift/release/blob/f113ad4a7bd6c6b5597901b2be6d38186982a0da/ci-operator/step-registry/gather/audit-logs/gather-audit-logs-commands.sh#L31
      3. extend https://github.com/openshift/must-gather/blob/b0f5083ca043c77bcc1b285d43afcd6a30386799/collection-scripts/gather_audit_logs to process raw audit logs before they are archived
      4. create a new step which invokes oc adm node-logs for openshift-apiserver and kube-apiserver paths independently of the must-gather

      Option 1 has advantage of creating a separate step which can be maintained independently of other steps. On the other hand the step needs to wait until the gather-audit-logs step is finished. Also, audit-logs.tar.gz archive and all individual kube-apiserver and openshift-apiserver archives need to be extracted.

      Option 2 saves the step of extracting audit-logs.tar.gz. On the other hand all individual kube-apiserver and openshift-apiserver archives still need to be extracted. No need to create a new step.

      Option 3 can work directly with all the individual kube-apiserver and openshift-apiserver audit logs. There are two additional options:

      1. intersect the command which pulls the audit logs in https://github.com/openshift/must-gather/blob/b0f5083ca043c77bcc1b285d43afcd6a30386799/collection-scripts/gather_audit_logs#L47 to also pipe the raw audit log lines into a new binary/script for further processing. This option does not require to pull kube-apiserver and openshift-apiserver audit logs twice
      2. run oc adm node-logs one more time only for kube-apiserver and openshift-apiserver audit logs and process the logs (audit logs pulled twice)

      Option 4 has advantage of creating a separate step which can be maintained independently of other steps. Also, the step can be invoked at any point. No need to change must-gather collecting scripts. Disadvantage of this option is getting kube-apiserver and openshift-apiserver audit logs pulled twice.

      Advantage of option 4 over option 1 is reduced need in storing all audit logs on file. Additionally, only a fraction of audit logs is processed further (verb=watch, username ends with "operator", stage=ResponseComplete, etc.). Testing of the overall solution is simplified as only a running cluster is required. Rough estimation of additionally pulled kube/openshift-apiserver logs is around 2G.

      The overall workflow

      1. for each relevant CI job calculate maximal number of watch requests per operator through all possible 60min long buckets (produced by gather-audit-log-stats step-registry step)
      2. upload the produced stats for the operators into BigQuery database
      3. have TRT dashboard calculate percentiles of their choosing

              jchaloup@redhat.com Jan Chaloupka
              rhn-engineering-dgoodwin Devan Goodwin
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: