Loading...

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.18, 4.19
Component/s: OLM
Labels:
- olmv1
- triaged

Regression:
None
Sprint:
Flareon OLM Sprint 266
sprint_count:
1
Release Blocker:
Approved
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.19.0
Target Backport Versions:

4.18, 4.19

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem

Seen in CI runs like nightly-4.18-e2e-vsphere-ovn-serial/1881600762434293760:

: [sig-arch][Early] Managed cluster should [apigroup:config.openshift.io] start all core operators [Skipped:Disconnected] [Suite:openshift/conformance/parallel] expand_less	2s
{  fail [github.com/openshift/origin/test/extended/operators/operators.go:139]: Some cluster operators are not ready: olm (Available=False CatalogdDeploymentCatalogdControllerManager_Deploying: CatalogdDeploymentCatalogdControllerManagerAvailable: Waiting for Deployment)
Ginkgo exit error 1: exit with code 1}

This seems to have been a brief hiccup in catalogd's leader lease:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-serial/1881600762434293760/artifacts/e2e-vsphere-ovn-serial/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-catalogd") | .time = (.firstTimestamp // .eventTime) | .time + " " + (.count | tostring) + " " + (.involvedObject | .kind + " " + .namespace + " " + .name) + " " + .reason + ": " + .message' | sort
2025-01-21T07:26:31Z 1 Pod openshift-catalogd catalogd-controller-manager-56d54f9c48-qnt7b Scheduled: Successfully assigned openshift-catalogd/catalogd-controller-manager-56d54f9c48-qnt7b to ci-op-f2vfirhs-40bbe-l4mch-master-1
...
2025-01-21T07:37:37Z 1 Lease openshift-catalogd catalogd-operator-lock LeaderElection: catalogd-controller-manager-56d54f9c48-qnt7b_9a7ccb95-82b3-4293-9fde-d6ce5092cfec became leader
2025-01-21T08:06:28Z 1 Lease openshift-catalogd catalogd-operator-lock LeaderElection: catalogd-controller-manager-56d54f9c48-qnt7b_319ca893-a268-4ea6-8cc8-982735bd03db became leader
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-serial/1881600762434293760/artifacts/e2e-vsphere-ovn-serial/gather-extra/artifacts/pods/openshift-catalogd_catalogd-controller-manager-56d54f9c48-qnt7b_manager_previous.log | tail -n6
I0121 08:01:37.599302       1 clustercatalog_controller.go:86] "reconcile starting" logger="catalogd-controller" controller="clustercatalog" controllerGroup="olm.operatorframework.io" controllerKind="ClusterCatalog" ClusterCatalog="openshift-redhat-operators" namespace="" name="openshift-redhat-operators" reconcileID="25fcd127-e192-4f08-b11f-e598e375493c"
I0121 08:01:37.599418       1 clustercatalog_controller.go:134] "reconcile ending" logger="catalogd-controller" controller="clustercatalog" controllerGroup="olm.operatorframework.io" controllerKind="ClusterCatalog" ClusterCatalog="openshift-redhat-operators" namespace="" name="openshift-redhat-operators" reconcileID="25fcd127-e192-4f08-b11f-e598e375493c"
E0121 08:06:02.751106       1 leaderelection.go:429] Failed to update lock optimitically: Put "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-catalogd/leases/catalogd-operator-lock": context deadline exceeded, falling back to slow path
E0121 08:06:02.751185       1 leaderelection.go:436] error retrieving resource lock openshift-catalogd/catalogd-operator-lock: client rate limiter Wait returned an error: context deadline exceeded
I0121 08:06:02.751200       1 leaderelection.go:297] failed to renew lease openshift-catalogd/catalogd-operator-lock: timed out waiting for the condition
E0121 08:06:02.751276       1 main.go:344] "problem running manager" err="leader election lost" logger="setup"

This bug is requesting that the ClusterOperator stay Available=True during this kind of brief hiccup, to conform with docs:

Available=False means at least part of the component is non-functional, and that the condition requires immediate administrator intervention.

and this run recovered quickly without requiring any admin intervention.

Version-Release number of selected component

CI Search turns up hits way back through 4.14, but probably only worth fixing in 4.18 and later when the component is GA (is this OLMv1?):

 curl -s 'https://search.dptools.openshift.org/search?search=Available.False.*Waiting+for+Deployment&maxAge=48h&type=junit&context=0' | jq -r 'keys[]' | grep -o '[-]4[.]1[0-9]' | sort | uniq -c | sort -n
      1 -4.14
      2 -4.16
      2 -4.17
      3 -4.15
     19 -4.18
     24 -4.19

How reproducible

Something like 2 to 36% impact for some flavors, based on:

$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?search=Available.False.*Waiting+for+Deployment&maxAge=48h&type=junit&context=0' | grep '[0-9][0-9] runs.*fail
ures match'
periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn-upgrade (all) - 71 runs, 27% failed, 11% of failures match = 3% impact
pull-ci-openshift-installer-main-e2e-vsphere-static-ovn (all) - 21 runs, 14% failed, 100% of failures match = 14% impact
openshift-machine-config-operator-4735-ci-4.19-e2e-azure-ovn-upgrade (all) - 10 runs, 30% failed, 33% of failures match = 10% impact
pull-ci-openshift-installer-main-e2e-vsphere-ovn-multi-network (all) - 21 runs, 24% failed, 20% of failures match = 5% impact
periodic-ci-openshift-release-master-okd-scos-4.18-e2e-vsphere-ovn (all) - 11 runs, 82% failed, 44% of failures match = 36% impact
pull-ci-openshift-installer-main-e2e-azure-ovn-resourcegroup (all) - 19 runs, 37% failed, 14% of failures match = 5% impact
periodic-ci-openshift-release-master-ci-4.19-e2e-aws-upgrade-ovn-single-node (all) - 71 runs, 54% failed, 3% of failures match = 1% impact
periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn (all) - 12 runs, 58% failed, 43% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade (all) - 60 runs, 40% failed, 4% of failures match = 2% impact
periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-azure-ovn-upgrade (all) - 70 runs, 39% failed, 7% of failures match = 3% impact
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node (all) - 32 runs, 50% failed, 6% of failures match = 3% impact
openshift-api-1997-ci-4.19-e2e-azure-ovn-upgrade (all) - 42 runs, 31% failed, 8% of failures match = 2% impact

Steps to Reproduce

1. Run lots of CI.
2. Search the results for Available=False olm.

Actual results

Lots of search hits, see How reproducible.

Expected results

No matches.

Additional info

The end goal is to avoid alarming cluster admins with noise that they don't need to look into. But we want to continue alerting admins if something is actually stuck and likely needs admin intervention. Giving components at least a few minutes to see if a concerning issue recovers doesn't add much delay to admin response in the situations where we don't automatically recover, and avoids alarming folks in situations where we do automatically recover. And you can be more nuanced than "give it a few minutes to see if we auto recover" if you have a better grasp on the underlying issue than me

is cloned by

OCPBUGS-49622 OLM should stay Available=True across brief component leader-lease hiccups

ON_QA

is depended on by

OCPBUGS-49622 OLM should stay Available=True across brief component leader-lease hiccups

ON_QA

links to

openshift/operator-framework-catalogd#127: UPSTREAM: <drop>: OCPBUGS-48765: leaderelection settings

openshift/operator-framework-operator-controller#249: OCPBUGS-48765: leaderelection settings

operator-framework/operator-controller/pull/1663

Details

Description

Description of problem

Version-Release number of selected component

How reproducible

Steps to Reproduce

Actual results

Expected results

Additional info

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates