-
Sub-task
-
Resolution: Done
-
Undefined
-
None
-
None
-
None
-
False
-
None
-
False
Idea: calculate percentiles as part of gathering artifacts from a cluster
Options:
- create a new step in the CI operator step registry to process an audit-logs.tar.gz archive produced by gather-audit-logs step
- process the audit logs right after the logs are pulled by must-gather: https://github.com/openshift/release/blob/f113ad4a7bd6c6b5597901b2be6d38186982a0da/ci-operator/step-registry/gather/audit-logs/gather-audit-logs-commands.sh#L31
- extend https://github.com/openshift/must-gather/blob/b0f5083ca043c77bcc1b285d43afcd6a30386799/collection-scripts/gather_audit_logs to process raw audit logs before they are archived
- create a new step which invokes oc adm node-logs for openshift-apiserver and kube-apiserver paths independently of the must-gather
Option 1 has advantage of creating a separate step which can be maintained independently of other steps. On the other hand the step needs to wait until the gather-audit-logs step is finished. Also, audit-logs.tar.gz archive and all individual kube-apiserver and openshift-apiserver archives need to be extracted.
Option 2 saves the step of extracting audit-logs.tar.gz. On the other hand all individual kube-apiserver and openshift-apiserver archives still need to be extracted. No need to create a new step.
Option 3 can work directly with all the individual kube-apiserver and openshift-apiserver audit logs. There are two additional options:
- intersect the command which pulls the audit logs in https://github.com/openshift/must-gather/blob/b0f5083ca043c77bcc1b285d43afcd6a30386799/collection-scripts/gather_audit_logs#L47 to also pipe the raw audit log lines into a new binary/script for further processing. This option does not require to pull kube-apiserver and openshift-apiserver audit logs twice
- run oc adm node-logs one more time only for kube-apiserver and openshift-apiserver audit logs and process the logs (audit logs pulled twice)
Option 4 has advantage of creating a separate step which can be maintained independently of other steps. Also, the step can be invoked at any point. No need to change must-gather collecting scripts. Disadvantage of this option is getting kube-apiserver and openshift-apiserver audit logs pulled twice.
Advantage of option 4 over option 1 is reduced need in storing all audit logs on file. Additionally, only a fraction of audit logs is processed further (verb=watch, username ends with "operator", stage=ResponseComplete, etc.). Testing of the overall solution is simplified as only a running cluster is required. Rough estimation of additionally pulled kube/openshift-apiserver logs is around 2G.
The overall workflow
- for each relevant CI job calculate maximal number of watch requests per operator through all possible 60min long buckets (produced by gather-audit-log-stats step-registry step)
- upload the produced stats for the operators into BigQuery database
- have TRT dashboard calculate percentiles of their choosing