Uploaded image for project: 'OpenShift Over the Air'
  1. OpenShift Over the Air
  2. OTA-983

Design changes to the CVO to reduce logging

XMLWordPrintable

    • Icon: Spike Spike
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • None
    • OTA 255

      The main goal of the Epic OTA-923 is to reduce the overall cluster-version-operator (CVO) logging. There are a lot of opportunities/ideas to reduce the logging. It would be helpful to come up with more ideas and decide what exactly we want to change in the CVO. The focus should be primarily placed on the most frequent logs.

       

      Some the ideas:

      • Utilize the logging levels more - When the OTA-209 is done, we can properly utilize the logging levels (We are currently mainly using level 2 with just a few exceptions in the CVO codebase)
      • Dynamically change the current logging level depending on the state of the operator/cluster (so as to not lose valuable information when something goes bad and the current logging level would not show the necessary information)
      • Do not display the multi-line diff when reconciling manifests at a lower logging level. This diff is very helpful to find out why a given update has happened. However, hotlooping bugs such as OCPBUGS-9070 tend to have a lower priority, and the caused logging is quite big, and the nature of hotlooping causes the diff to be logged indefinitely.
      # grep -A60 'Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff' cvo2.log | tail -n61
      I0110 06:32:44.489277       1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff:   &unstructured.Unstructured{
        	Object: map[string]interface{}{
        		"apiVersion": string("batch/v1"),
        		"kind":       string("CronJob"),
        		"metadata":   map[string]interface{}{"annotations": map[string]interface{}{"include.release.openshift.io/ibm-cloud-managed": string("true"), "include.release.openshift.io/self-managed-high-availability": string("true")}, "creationTimestamp": string("2022-01-10T04:35:19Z"), "generation": int64(1), "managedFields": []interface{}{map[string]interface{}{"apiVersion": string("batch/v1"), "fieldsType": string("FieldsV1"), "fieldsV1": map[string]interface{}{"f:metadata": map[string]interface{}{"f:annotations": map[string]interface{}{".": map[string]interface{}{}, "f:include.release.openshift.io/ibm-cloud-managed": map[string]interface{}{}, "f:include.release.openshift.io/self-managed-high-availability": map[string]interface{}{}}, "f:ownerReferences": map[string]interface{}{".": map[string]interface{}{}, `k:{"uid":"334d6c04-126d-4271-96ec-d303e93b7d1c"}`: map[string]interface{}{}}}, "f:spec": map[string]interface{}{"f:concurrencyPolicy": map[string]interface{}{}, "f:failedJobsHistoryLimit": map[string]interface{}{}, "f:jobTemplate": map[string]interface{}{"f:spec": map[string]interface{}{"f:template": map[string]interface{}{"f:spec": map[string]interface{}{"f:containers": map[string]interface{}{`k:{"name":"collect-profiles"}`: map[string]interface{}{".": map[string]interface{}{}, "f:args": map[string]interface{}{}, "f:command": map[string]interface{}{}, "f:image": map[string]interface{}{}, ...}}, "f:dnsPolicy": map[string]interface{}{}, "f:priorityClassName": map[string]interface{}{}, "f:restartPolicy": map[string]interface{}{}, ...}}}}, "f:schedule": map[string]interface{}{}, ...}}, "manager": string("cluster-version-operator"), ...}, map[string]interface{}{"apiVersion": string("batch/v1"), "fieldsType": string("FieldsV1"), "fieldsV1": map[string]interface{}{"f:status": map[string]interface{}{"f:lastScheduleTime": map[string]interface{}{}, "f:lastSuccessfulTime": map[string]interface{}{}}}, "manager": string("kube-controller-manager"), ...}}, ...},
        		"spec": map[string]interface{}{
      + 			"concurrencyPolicy":      string("Allow"),
      + 			"failedJobsHistoryLimit": int64(1),
        			"jobTemplate": map[string]interface{}{
      + 				"metadata": map[string]interface{}{"creationTimestamp": nil},
        				"spec": map[string]interface{}{
        					"template": map[string]interface{}{
      + 						"metadata": map[string]interface{}{"creationTimestamp": nil},
        						"spec": map[string]interface{}{
        							"containers": []interface{}{
        								map[string]interface{}{
        									... // 4 identical entries
        									"name":                     string("collect-profiles"),
        									"resources":                map[string]interface{}{"requests": map[string]interface{}{"cpu": string("10m"), "memory": string("80Mi")}},
      + 									"terminationMessagePath":   string("/dev/termination-log"),
      + 									"terminationMessagePolicy": string("File"),
        									"volumeMounts":             []interface{}{map[string]interface{}{"mountPath": string("/etc/config"), "name": string("config-volume")}, map[string]interface{}{"mountPath": string("/var/run/secrets/serving-cert"), "name": string("secret-volume")}},
        								},
        							},
      + 							"dnsPolicy":                     string("ClusterFirst"),
        							"priorityClassName":             string("openshift-user-critical"),
        							"restartPolicy":                 string("Never"),
      + 							"schedulerName":                 string("default-scheduler"),
      + 							"securityContext":               map[string]interface{}{},
      + 							"serviceAccount":                string("collect-profiles"),
        							"serviceAccountName":            string("collect-profiles"),
      + 							"terminationGracePeriodSeconds": int64(30),
        							"volumes": []interface{}{
        								map[string]interface{}{
        									"configMap": map[string]interface{}{
      + 										"defaultMode": int64(420),
        										"name":        string("collect-profiles-config"),
        									},
        									"name": string("config-volume"),
        								},
        								map[string]interface{}{
        									"name": string("secret-volume"),
        									"secret": map[string]interface{}{
      + 										"defaultMode": int64(420),
        										"secretName":  string("pprof-cert"),
        									},
        								},
        							},
        						},
        					},
        				},
        			},
        			"schedule":                   string("*/15 * * * *"),
      + 			"successfulJobsHistoryLimit": int64(3),
      + 			"suspend":                    bool(false),
        		},
        		"status": map[string]interface{}{"lastScheduleTime": string("2022-01-10T06:30:00Z"), "lastSuccessfulTime": string("2022-01-10T06:30:11Z")},
        	},
        }
      I0110 06:32:44.499764       1 sync_worker.go:771] Done syncing for cronjob "openshift-operator-lifecycle-manager/collect-profiles" (574 of 765)
      I0110 06:32:44.499814       1 sync_worker.go:759] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (575 of 765) 
      • Reduce the logging when reconciling manifests. Most of the logs come from reconciling manifests. For example, maybe we could only log when things go bad when reconciling manifests (or utilize the dynamic log levels as mentioned above, or log some sort of a summary of the successful reconciliations).
      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-serial/1661746644430884864/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-5ff7868877-kzkk9_cluster-version-operator.log | grep -o '[^ ]*[.]go:[0-9]*' | sort | uniq -c | sort -n | tail -n5
         1386 sync_worker.go:972
         3612 task_graph.go:477
         9887 sync_worker.go:702
        35206 sync_worker.go:1001
        35246 sync_worker.go:981
      $ git grep -n . origin/master -- pkg/cvo/sync_worker.go | grep '981\|1001'
      origin/master:pkg/cvo/sync_worker.go:981: klog.V(2).Infof("Running sync for %s", task)
      origin/master:pkg/cvo/sync_worker.go:1001: klog.V(2).Infof("Done syncing for %s", task)
      
      • Come up with new ideas.

       

      Definition of Done:

      • Ideas to reduce the logs were proposed, and we have an agreement on them.
      • Finalize the idea for the implementation.

            Unassigned Unassigned
            dhurta@redhat.com David Hurta
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: