Description of problem
Seen in CI runs like nightly-4.18-e2e-vsphere-ovn-serial/1881600762434293760:
: [sig-arch][Early] Managed cluster should [apigroup:config.openshift.io] start all core operators [Skipped:Disconnected] [Suite:openshift/conformance/parallel] expand_less 2s { fail [github.com/openshift/origin/test/extended/operators/operators.go:139]: Some cluster operators are not ready: olm (Available=False CatalogdDeploymentCatalogdControllerManager_Deploying: CatalogdDeploymentCatalogdControllerManagerAvailable: Waiting for Deployment) Ginkgo exit error 1: exit with code 1}
This seems to have been a brief hiccup in catalogd's leader lease:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-serial/1881600762434293760/artifacts/e2e-vsphere-ovn-serial/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-catalogd") | .time = (.firstTimestamp // .eventTime) | .time + " " + (.count | tostring) + " " + (.involvedObject | .kind + " " + .namespace + " " + .name) + " " + .reason + ": " + .message' | sort 2025-01-21T07:26:31Z 1 Pod openshift-catalogd catalogd-controller-manager-56d54f9c48-qnt7b Scheduled: Successfully assigned openshift-catalogd/catalogd-controller-manager-56d54f9c48-qnt7b to ci-op-f2vfirhs-40bbe-l4mch-master-1 ... 2025-01-21T07:37:37Z 1 Lease openshift-catalogd catalogd-operator-lock LeaderElection: catalogd-controller-manager-56d54f9c48-qnt7b_9a7ccb95-82b3-4293-9fde-d6ce5092cfec became leader 2025-01-21T08:06:28Z 1 Lease openshift-catalogd catalogd-operator-lock LeaderElection: catalogd-controller-manager-56d54f9c48-qnt7b_319ca893-a268-4ea6-8cc8-982735bd03db became leader $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-serial/1881600762434293760/artifacts/e2e-vsphere-ovn-serial/gather-extra/artifacts/pods/openshift-catalogd_catalogd-controller-manager-56d54f9c48-qnt7b_manager_previous.log | tail -n6 I0121 08:01:37.599302 1 clustercatalog_controller.go:86] "reconcile starting" logger="catalogd-controller" controller="clustercatalog" controllerGroup="olm.operatorframework.io" controllerKind="ClusterCatalog" ClusterCatalog="openshift-redhat-operators" namespace="" name="openshift-redhat-operators" reconcileID="25fcd127-e192-4f08-b11f-e598e375493c" I0121 08:01:37.599418 1 clustercatalog_controller.go:134] "reconcile ending" logger="catalogd-controller" controller="clustercatalog" controllerGroup="olm.operatorframework.io" controllerKind="ClusterCatalog" ClusterCatalog="openshift-redhat-operators" namespace="" name="openshift-redhat-operators" reconcileID="25fcd127-e192-4f08-b11f-e598e375493c" E0121 08:06:02.751106 1 leaderelection.go:429] Failed to update lock optimitically: Put "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-catalogd/leases/catalogd-operator-lock": context deadline exceeded, falling back to slow path E0121 08:06:02.751185 1 leaderelection.go:436] error retrieving resource lock openshift-catalogd/catalogd-operator-lock: client rate limiter Wait returned an error: context deadline exceeded I0121 08:06:02.751200 1 leaderelection.go:297] failed to renew lease openshift-catalogd/catalogd-operator-lock: timed out waiting for the condition E0121 08:06:02.751276 1 main.go:344] "problem running manager" err="leader election lost" logger="setup"
This bug is requesting that the ClusterOperator stay Available=True during this kind of brief hiccup, to conform with docs:
Available=False means at least part of the component is non-functional, and that the condition requires immediate administrator intervention.
and this run recovered quickly without requiring any admin intervention.
Version-Release number of selected component
CI Search turns up hits way back through 4.14, but probably only worth fixing in 4.18 and later when the component is GA (is this OLMv1?):
curl -s 'https://search.dptools.openshift.org/search?search=Available.False.*Waiting+for+Deployment&maxAge=48h&type=junit&context=0' | jq -r 'keys[]' | grep -o '[-]4[.]1[0-9]' | sort | uniq -c | sort -n 1 -4.14 2 -4.16 2 -4.17 3 -4.15 19 -4.18 24 -4.19
How reproducible
Something like 2 to 36% impact for some flavors, based on:
$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?search=Available.False.*Waiting+for+Deployment&maxAge=48h&type=junit&context=0' | grep '[0-9][0-9] runs.*fail ures match' periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn-upgrade (all) - 71 runs, 27% failed, 11% of failures match = 3% impact pull-ci-openshift-installer-main-e2e-vsphere-static-ovn (all) - 21 runs, 14% failed, 100% of failures match = 14% impact openshift-machine-config-operator-4735-ci-4.19-e2e-azure-ovn-upgrade (all) - 10 runs, 30% failed, 33% of failures match = 10% impact pull-ci-openshift-installer-main-e2e-vsphere-ovn-multi-network (all) - 21 runs, 24% failed, 20% of failures match = 5% impact periodic-ci-openshift-release-master-okd-scos-4.18-e2e-vsphere-ovn (all) - 11 runs, 82% failed, 44% of failures match = 36% impact pull-ci-openshift-installer-main-e2e-azure-ovn-resourcegroup (all) - 19 runs, 37% failed, 14% of failures match = 5% impact periodic-ci-openshift-release-master-ci-4.19-e2e-aws-upgrade-ovn-single-node (all) - 71 runs, 54% failed, 3% of failures match = 1% impact periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn (all) - 12 runs, 58% failed, 43% of failures match = 25% impact periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade (all) - 60 runs, 40% failed, 4% of failures match = 2% impact periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-azure-ovn-upgrade (all) - 70 runs, 39% failed, 7% of failures match = 3% impact pull-ci-openshift-origin-master-e2e-aws-ovn-single-node (all) - 32 runs, 50% failed, 6% of failures match = 3% impact openshift-api-1997-ci-4.19-e2e-azure-ovn-upgrade (all) - 42 runs, 31% failed, 8% of failures match = 2% impact
Steps to Reproduce
1. Run lots of CI.
2. Search the results for Available=False olm.
Actual results
Lots of search hits, see How reproducible.
Expected results
No matches.
Additional info
The end goal is to avoid alarming cluster admins with noise that they don't need to look into. But we want to continue alerting admins if something is actually stuck and likely needs admin intervention. Giving components at least a few minutes to see if a concerning issue recovers doesn't add much delay to admin response in the situations where we don't automatically recover, and avoids alarming folks in situations where we do automatically recover. And you can be more nuanced than "give it a few minutes to see if we auto recover" if you have a better grasp on the underlying issue than me
- is cloned by
-
OCPBUGS-49622 OLM should stay Available=True across brief component leader-lease hiccups
- ON_QA
- is depended on by
-
OCPBUGS-49622 OLM should stay Available=True across brief component leader-lease hiccups
- ON_QA
- links to