Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-24061

CSO generates excessive progressing condition events

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Normal Normal
    • 4.16.0
    • 4.15.0
    • Storage / Operators
    • None
    • Moderate
    • No
    • False
    • Hide

      None

      Show
      None
    • NA
    • Release Note Not Required
    • In Progress

      We had this CI job failing because clusteroperator/storage kept flip-flopping between progressing=True and progressing=False

      https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_api/1684/pull-ci-openshift-api-master-e2e-aws-serial-techpreview/1729464659330732032

      : [sig-arch] events should not repeat pathologically for ns/openshift-cluster-storage-operator expand_less    0s
      {  1 events happened too frequently
      event happened 21 times, something is wrong: namespace/openshift-cluster-storage-operator deployment/cluster-storage-operator hmsg/cfc7e5cdbe - reason/OperatorStatusChanged Status for clusteroperator/storage changed: Progressing changed from True to False ("AWSEBSCSIDriverOperatorCRProgressing: All is well\nSHARESCSIDriverOperatorCRProgressing: All is well") From: 14:13:20Z To: 14:13:21Z result=reject }
      

      This exposed OCPBUGS-24027 which is now fixed.

      However, there are still an excessive number of progressing events from this job.

      $ grep 'clusteroperator/storage changed: Progressing' events.txt > progressing.txt
      $ wc -l progressing.txt 
      28 progressing.txt

      A small subset of those actually change between True and Flase

      $ grep 'clusteroperator/storage changed: Progressing' events.txt | grep True
      openshift-cluster-storage-operator                 143m        Normal    OperatorStatusChanged                        deployment/cluster-storage-operator                                            Status for clusteroperator/storage changed: Progressing changed from Unknown to False ("All is well"),Available changed from Unknown to True ("DefaultStorageClassControllerAvailable: StorageClass provided by supplied CSI Driver instead of the cluster-storage-operator")
      openshift-cluster-storage-operator                 143m        Normal    OperatorStatusChanged                        deployment/cluster-storage-operator                                            Status for clusteroperator/storage changed: Progressing changed from False to True ("AWSEBSProgressing: Waiting for Deployment to act on changes")
      openshift-cluster-storage-operator                 143m        Normal    OperatorStatusChanged                        deployment/cluster-storage-operator                                            Status for clusteroperator/storage changed: Progressing message changed from "AWSEBSProgressing: Waiting for Deployment to deploy pods" to "AWSEBSCSIDriverOperatorCRProgressing: Waiting for AWSEBS operator to report status\nAWSEBSProgressing: Waiting for Deployment to deploy pods",Available changed from True to False ("AWSEBSCSIDriverOperatorCRAvailable: Waiting for AWSEBS operator to report status"),Upgradeable changed from Unknown to True ("All is well")
      openshift-cluster-storage-operator                 136m        Normal    OperatorStatusChanged                        deployment/cluster-storage-operator                                            Status for clusteroperator/storage changed: Progressing changed from True to False ("AWSEBSCSIDriverOperatorCRProgressing: All is well\nSHARESCSIDriverOperatorCRProgressing: All is well")
      openshift-cluster-storage-operator                 45m         Normal    OperatorStatusChanged                        deployment/cluster-storage-operator                                            Status for clusteroperator/storage changed: Progressing changed from False to True ("AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods")
      openshift-cluster-storage-operator                 2m11s       Normal    OperatorStatusChanged                        deployment/cluster-storage-operator                                            Status for clusteroperator/storage changed: Progressing changed from True to False ("AWSEBSCSIDriverOperatorCRProgressing: All is well\nSHARESCSIDriverOperatorCRProgressing: All is well")
      openshift-cluster-storage-operator                 8m6s        Normal    OperatorStatusChanged                        deployment/cluster-storage-operator                                            Status for clusteroperator/storage changed: Progressing changed from False to True ("SHARESCSIDriverOperatorCRProgressing: SharedResourcesDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods")
      openshift-cluster-storage-operator                 2m12s       Normal    OperatorStatusChanged                        deployment/cluster-storage-operator                                            Status for clusteroperator/storage changed: Progressing changed from False to True ("SHARESProgressing: Waiting for Deployment to deploy pods")

      But then we end up with events like this for example, where CSO has just appended the status message with more noise between competing controllers:

      openshift-cluster-storage-operator                 142m        Normal    OperatorStatusChanged                        deployment/cluster-storage-operator                                            Status for clusteroperator/storage changed: Progressing message changed from "AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverControllerServiceControllerProgressing: Waiting for Deployment to act on changes\nAWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods\nSHARESCSIDriverOperatorCRProgressing: SharedResourceCSIDriverWebhookControllerProgressing: Waiting for Deployment to deploy pods" to "AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods\nAWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods\nSHARESCSIDriverOperatorCRProgressing: SharedResourceCSIDriverWebhookControllerProgressing: Waiting for Deployment to deploy pods",Available message changed from "AWSEBSCSIDriverOperatorCRAvailable: AWSEBSDriverControllerServiceControllerAvailable: Waiting for Deployment\nAWSEBSCSIDriverOperatorCRAvailable: AWSEBSDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service\nSHARESCSIDriverOperatorCRAvailable: SharedResourceCSIDriverWebhookControllerAvailable: Waiting for Deployment\nSHARESCSIDriverOperatorCRAvailable: SharedResourcesDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service" to "AWSEBSCSIDriverOperatorCRAvailable: AWSEBSDriverControllerServiceControllerAvailable: Waiting for Deployment\nSHARESCSIDriverOperatorCRAvailable: SharedResourceCSIDriverWebhookControllerAvailable: Waiting for Deployment\nSHARESCSIDriverOperatorCRAvailable: SharedResourcesDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service" 

      There are multiple controllers for multiple operators updating the progressing condition, which generates an excessive number of events. This would be (at least) annoying on a live cluster, but it also leaves CSO succeptible to `events should not repeat pathologically` test flakes in CI.

              jdobson@redhat.com Jonathan Dobson
              jdobson@redhat.com Jonathan Dobson
              Wei Duan Wei Duan
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: