Loading...

Type: Spike
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- groomed

Story Points:
3
Blocked:
False
Blocked Reason:
None
Ready:
False
Epic Link:
Reduce cluster version operator logging
Intelligence Requested:
Market:

Sprint:
OTA 255

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

The main goal of the Epic OTA-923 is to reduce the overall cluster-version-operator (CVO) logging. There are a lot of opportunities/ideas to reduce the logging. It would be helpful to come up with more ideas and decide what exactly we want to change in the CVO. The focus should be primarily placed on the most frequent logs.

Some the ideas:

Utilize the logging levels more - When the OTA-209 is done, we can properly utilize the logging levels (We are currently mainly using level 2 with just a few exceptions in the CVO codebase)
Dynamically change the current logging level depending on the state of the operator/cluster (so as to not lose valuable information when something goes bad and the current logging level would not show the necessary information)
Do not display the multi-line diff when reconciling manifests at a lower logging level. This diff is very helpful to find out why a given update has happened. However, hotlooping bugs such as ~~OCPBUGS-9070~~ tend to have a lower priority, and the caused logging is quite big, and the nature of hotlooping causes the diff to be logged indefinitely.

# grep -A60 'Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff' cvo2.log | tail -n61
I0110 06:32:44.489277       1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff:   &unstructured.Unstructured{
  	Object: map[string]interface{}{
  		"apiVersion": string("batch/v1"),
  		"kind":       string("CronJob"),
  		"metadata":   map[string]interface{}{"annotations": map[string]interface{}{"include.release.openshift.io/ibm-cloud-managed": string("true"), "include.release.openshift.io/self-managed-high-availability": string("true")}, "creationTimestamp": string("2022-01-10T04:35:19Z"), "generation": int64(1), "managedFields": []interface{}{map[string]interface{}{"apiVersion": string("batch/v1"), "fieldsType": string("FieldsV1"), "fieldsV1": map[string]interface{}{"f:metadata": map[string]interface{}{"f:annotations": map[string]interface{}{".": map[string]interface{}{}, "f:include.release.openshift.io/ibm-cloud-managed": map[string]interface{}{}, "f:include.release.openshift.io/self-managed-high-availability": map[string]interface{}{}}, "f:ownerReferences": map[string]interface{}{".": map[string]interface{}{}, `k:{"uid":"334d6c04-126d-4271-96ec-d303e93b7d1c"}`: map[string]interface{}{}}}, "f:spec": map[string]interface{}{"f:concurrencyPolicy": map[string]interface{}{}, "f:failedJobsHistoryLimit": map[string]interface{}{}, "f:jobTemplate": map[string]interface{}{"f:spec": map[string]interface{}{"f:template": map[string]interface{}{"f:spec": map[string]interface{}{"f:containers": map[string]interface{}{`k:{"name":"collect-profiles"}`: map[string]interface{}{".": map[string]interface{}{}, "f:args": map[string]interface{}{}, "f:command": map[string]interface{}{}, "f:image": map[string]interface{}{}, ...}}, "f:dnsPolicy": map[string]interface{}{}, "f:priorityClassName": map[string]interface{}{}, "f:restartPolicy": map[string]interface{}{}, ...}}}}, "f:schedule": map[string]interface{}{}, ...}}, "manager": string("cluster-version-operator"), ...}, map[string]interface{}{"apiVersion": string("batch/v1"), "fieldsType": string("FieldsV1"), "fieldsV1": map[string]interface{}{"f:status": map[string]interface{}{"f:lastScheduleTime": map[string]interface{}{}, "f:lastSuccessfulTime": map[string]interface{}{}}}, "manager": string("kube-controller-manager"), ...}}, ...},
  		"spec": map[string]interface{}{
+ 			"concurrencyPolicy":      string("Allow"),
+ 			"failedJobsHistoryLimit": int64(1),
  			"jobTemplate": map[string]interface{}{
+ 				"metadata": map[string]interface{}{"creationTimestamp": nil},
  				"spec": map[string]interface{}{
  					"template": map[string]interface{}{
+ 						"metadata": map[string]interface{}{"creationTimestamp": nil},
  						"spec": map[string]interface{}{
  							"containers": []interface{}{
  								map[string]interface{}{
  									... // 4 identical entries
  									"name":                     string("collect-profiles"),
  									"resources":                map[string]interface{}{"requests": map[string]interface{}{"cpu": string("10m"), "memory": string("80Mi")}},
+ 									"terminationMessagePath":   string("/dev/termination-log"),
+ 									"terminationMessagePolicy": string("File"),
  									"volumeMounts":             []interface{}{map[string]interface{}{"mountPath": string("/etc/config"), "name": string("config-volume")}, map[string]interface{}{"mountPath": string("/var/run/secrets/serving-cert"), "name": string("secret-volume")}},
  								},
  							},
+ 							"dnsPolicy":                     string("ClusterFirst"),
  							"priorityClassName":             string("openshift-user-critical"),
  							"restartPolicy":                 string("Never"),
+ 							"schedulerName":                 string("default-scheduler"),
+ 							"securityContext":               map[string]interface{}{},
+ 							"serviceAccount":                string("collect-profiles"),
  							"serviceAccountName":            string("collect-profiles"),
+ 							"terminationGracePeriodSeconds": int64(30),
  							"volumes": []interface{}{
  								map[string]interface{}{
  									"configMap": map[string]interface{}{
+ 										"defaultMode": int64(420),
  										"name":        string("collect-profiles-config"),
  									},
  									"name": string("config-volume"),
  								},
  								map[string]interface{}{
  									"name": string("secret-volume"),
  									"secret": map[string]interface{}{
+ 										"defaultMode": int64(420),
  										"secretName":  string("pprof-cert"),
  									},
  								},
  							},
  						},
  					},
  				},
  			},
  			"schedule":                   string("*/15 * * * *"),
+ 			"successfulJobsHistoryLimit": int64(3),
+ 			"suspend":                    bool(false),
  		},
  		"status": map[string]interface{}{"lastScheduleTime": string("2022-01-10T06:30:00Z"), "lastSuccessfulTime": string("2022-01-10T06:30:11Z")},
  	},
  }
I0110 06:32:44.499764       1 sync_worker.go:771] Done syncing for cronjob "openshift-operator-lifecycle-manager/collect-profiles" (574 of 765)
I0110 06:32:44.499814       1 sync_worker.go:759] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (575 of 765)

Reduce the logging when reconciling manifests. Most of the logs come from reconciling manifests. For example, maybe we could only log when things go bad when reconciling manifests (or utilize the dynamic log levels as mentioned above, or log some sort of a summary of the successful reconciliations).

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-serial/1661746644430884864/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-5ff7868877-kzkk9_cluster-version-operator.log | grep -o '[^ ]*[.]go:[0-9]*' | sort | uniq -c | sort -n | tail -n5
   1386 sync_worker.go:972
   3612 task_graph.go:477
   9887 sync_worker.go:702
  35206 sync_worker.go:1001
  35246 sync_worker.go:981
$ git grep -n . origin/master -- pkg/cvo/sync_worker.go | grep '981\|1001'
origin/master:pkg/cvo/sync_worker.go:981: klog.V(2).Infof("Running sync for %s", task)
origin/master:pkg/cvo/sync_worker.go:1001: klog.V(2).Infof("Done syncing for %s", task)

Come up with new ideas.

Definition of Done:

Ideas to reduce the logs were proposed, and we have an agreement on them.
Finalize the idea for the implementation.

blocks

OTA-984 Implement changes to reduce CVO logging

To Do

Details

Description

Attachments

Issue Links

Activity

People

Dates

Hide