Loading...

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: 4.16.0
Affects Version/s: 4.15.0
Component/s: Storage / Operators
Labels:
None

Severity:
Moderate
Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
NA
Release Note Type:
Release Note Not Required
Release Note Status:
In Progress
Target Version:

4.16.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

We had this CI job failing because clusteroperator/storage kept flip-flopping between progressing=True and progressing=False

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_api/1684/pull-ci-openshift-api-master-e2e-aws-serial-techpreview/1729464659330732032

: [sig-arch] events should not repeat pathologically for ns/openshift-cluster-storage-operator expand_less    0s
{  1 events happened too frequently
event happened 21 times, something is wrong: namespace/openshift-cluster-storage-operator deployment/cluster-storage-operator hmsg/cfc7e5cdbe - reason/OperatorStatusChanged Status for clusteroperator/storage changed: Progressing changed from True to False ("AWSEBSCSIDriverOperatorCRProgressing: All is well\nSHARESCSIDriverOperatorCRProgressing: All is well") From: 14:13:20Z To: 14:13:21Z result=reject }

This exposed ~~OCPBUGS-24027~~ which is now fixed.

However, there are still an excessive number of progressing events from this job.

$ grep 'clusteroperator/storage changed: Progressing' events.txt > progressing.txt
$ wc -l progressing.txt 
28 progressing.txt

A small subset of those actually change between True and Flase

$ grep 'clusteroperator/storage changed: Progressing' events.txt | grep True
openshift-cluster-storage-operator                 143m        Normal    OperatorStatusChanged                        deployment/cluster-storage-operator                                            Status for clusteroperator/storage changed: Progressing changed from Unknown to False ("All is well"),Available changed from Unknown to True ("DefaultStorageClassControllerAvailable: StorageClass provided by supplied CSI Driver instead of the cluster-storage-operator")
openshift-cluster-storage-operator                 143m        Normal    OperatorStatusChanged                        deployment/cluster-storage-operator                                            Status for clusteroperator/storage changed: Progressing changed from False to True ("AWSEBSProgressing: Waiting for Deployment to act on changes")
openshift-cluster-storage-operator                 143m        Normal    OperatorStatusChanged                        deployment/cluster-storage-operator                                            Status for clusteroperator/storage changed: Progressing message changed from "AWSEBSProgressing: Waiting for Deployment to deploy pods" to "AWSEBSCSIDriverOperatorCRProgressing: Waiting for AWSEBS operator to report status\nAWSEBSProgressing: Waiting for Deployment to deploy pods",Available changed from True to False ("AWSEBSCSIDriverOperatorCRAvailable: Waiting for AWSEBS operator to report status"),Upgradeable changed from Unknown to True ("All is well")
openshift-cluster-storage-operator                 136m        Normal    OperatorStatusChanged                        deployment/cluster-storage-operator                                            Status for clusteroperator/storage changed: Progressing changed from True to False ("AWSEBSCSIDriverOperatorCRProgressing: All is well\nSHARESCSIDriverOperatorCRProgressing: All is well")
openshift-cluster-storage-operator                 45m         Normal    OperatorStatusChanged                        deployment/cluster-storage-operator                                            Status for clusteroperator/storage changed: Progressing changed from False to True ("AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods")
openshift-cluster-storage-operator                 2m11s       Normal    OperatorStatusChanged                        deployment/cluster-storage-operator                                            Status for clusteroperator/storage changed: Progressing changed from True to False ("AWSEBSCSIDriverOperatorCRProgressing: All is well\nSHARESCSIDriverOperatorCRProgressing: All is well")
openshift-cluster-storage-operator                 8m6s        Normal    OperatorStatusChanged                        deployment/cluster-storage-operator                                            Status for clusteroperator/storage changed: Progressing changed from False to True ("SHARESCSIDriverOperatorCRProgressing: SharedResourcesDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods")
openshift-cluster-storage-operator                 2m12s       Normal    OperatorStatusChanged                        deployment/cluster-storage-operator                                            Status for clusteroperator/storage changed: Progressing changed from False to True ("SHARESProgressing: Waiting for Deployment to deploy pods")

But then we end up with events like this for example, where CSO has just appended the status message with more noise between competing controllers:

openshift-cluster-storage-operator                 142m        Normal    OperatorStatusChanged                        deployment/cluster-storage-operator                                            Status for clusteroperator/storage changed: Progressing message changed from "AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverControllerServiceControllerProgressing: Waiting for Deployment to act on changes\nAWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods\nSHARESCSIDriverOperatorCRProgressing: SharedResourceCSIDriverWebhookControllerProgressing: Waiting for Deployment to deploy pods" to "AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods\nAWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods\nSHARESCSIDriverOperatorCRProgressing: SharedResourceCSIDriverWebhookControllerProgressing: Waiting for Deployment to deploy pods",Available message changed from "AWSEBSCSIDriverOperatorCRAvailable: AWSEBSDriverControllerServiceControllerAvailable: Waiting for Deployment\nAWSEBSCSIDriverOperatorCRAvailable: AWSEBSDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service\nSHARESCSIDriverOperatorCRAvailable: SharedResourceCSIDriverWebhookControllerAvailable: Waiting for Deployment\nSHARESCSIDriverOperatorCRAvailable: SharedResourcesDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service" to "AWSEBSCSIDriverOperatorCRAvailable: AWSEBSDriverControllerServiceControllerAvailable: Waiting for Deployment\nSHARESCSIDriverOperatorCRAvailable: SharedResourceCSIDriverWebhookControllerAvailable: Waiting for Deployment\nSHARESCSIDriverOperatorCRAvailable: SharedResourcesDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service"

There are multiple controllers for multiple operators updating the progressing condition, which generates an excessive number of events. This would be (at least) annoying on a live cluster, but it also leaves CSO succeptible to `events should not repeat pathologically` test flakes in CI.

links to

openshift/library-go#1734: OCPBUGS-24061: Keep CSI operators progressing=true during DaemonSet rollout

openshift/vmware-vsphere-csi-driver-operator#230: WIP: OCPBUGS-24061: Keep CSI operators progressing=true during DaemonSet rollout

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates