Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-49622

OLM should stay Available=True across brief component leader-lease hiccups

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • 4.18, 4.19
    • OLM
    • None
    • Flareon OLM Sprint 266
    • 1
    • Approved
    • False
    • Hide

      None

      Show
      None
    • Hide
      *Cause*: What actions or circumstances cause this bug to present.
      *Consequence*: What happens when the bug presents.
      *Fix*: What was done to fix the bug.
      *Result*: Bug doesn’t present anymore.
      Show
      *Cause*: What actions or circumstances cause this bug to present. *Consequence*: What happens when the bug presents. *Fix*: What was done to fix the bug. *Result*: Bug doesn’t present anymore.
    • Bug Fix
    • In Progress

      Description of problem

      Seen in CI runs like nightly-4.18-e2e-vsphere-ovn-serial/1881600762434293760:

      : [sig-arch][Early] Managed cluster should [apigroup:config.openshift.io] start all core operators [Skipped:Disconnected] [Suite:openshift/conformance/parallel] expand_less	2s
      {  fail [github.com/openshift/origin/test/extended/operators/operators.go:139]: Some cluster operators are not ready: olm (Available=False CatalogdDeploymentCatalogdControllerManager_Deploying: CatalogdDeploymentCatalogdControllerManagerAvailable: Waiting for Deployment)
      Ginkgo exit error 1: exit with code 1}
      

      This seems to have been a brief hiccup in catalogd's leader lease:

      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-serial/1881600762434293760/artifacts/e2e-vsphere-ovn-serial/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-catalogd") | .time = (.firstTimestamp // .eventTime) | .time + " " + (.count | tostring) + " " + (.involvedObject | .kind + " " + .namespace + " " + .name) + " " + .reason + ": " + .message' | sort
      2025-01-21T07:26:31Z 1 Pod openshift-catalogd catalogd-controller-manager-56d54f9c48-qnt7b Scheduled: Successfully assigned openshift-catalogd/catalogd-controller-manager-56d54f9c48-qnt7b to ci-op-f2vfirhs-40bbe-l4mch-master-1
      ...
      2025-01-21T07:37:37Z 1 Lease openshift-catalogd catalogd-operator-lock LeaderElection: catalogd-controller-manager-56d54f9c48-qnt7b_9a7ccb95-82b3-4293-9fde-d6ce5092cfec became leader
      2025-01-21T08:06:28Z 1 Lease openshift-catalogd catalogd-operator-lock LeaderElection: catalogd-controller-manager-56d54f9c48-qnt7b_319ca893-a268-4ea6-8cc8-982735bd03db became leader
      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-serial/1881600762434293760/artifacts/e2e-vsphere-ovn-serial/gather-extra/artifacts/pods/openshift-catalogd_catalogd-controller-manager-56d54f9c48-qnt7b_manager_previous.log | tail -n6
      I0121 08:01:37.599302       1 clustercatalog_controller.go:86] "reconcile starting" logger="catalogd-controller" controller="clustercatalog" controllerGroup="olm.operatorframework.io" controllerKind="ClusterCatalog" ClusterCatalog="openshift-redhat-operators" namespace="" name="openshift-redhat-operators" reconcileID="25fcd127-e192-4f08-b11f-e598e375493c"
      I0121 08:01:37.599418       1 clustercatalog_controller.go:134] "reconcile ending" logger="catalogd-controller" controller="clustercatalog" controllerGroup="olm.operatorframework.io" controllerKind="ClusterCatalog" ClusterCatalog="openshift-redhat-operators" namespace="" name="openshift-redhat-operators" reconcileID="25fcd127-e192-4f08-b11f-e598e375493c"
      E0121 08:06:02.751106       1 leaderelection.go:429] Failed to update lock optimitically: Put "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-catalogd/leases/catalogd-operator-lock": context deadline exceeded, falling back to slow path
      E0121 08:06:02.751185       1 leaderelection.go:436] error retrieving resource lock openshift-catalogd/catalogd-operator-lock: client rate limiter Wait returned an error: context deadline exceeded
      I0121 08:06:02.751200       1 leaderelection.go:297] failed to renew lease openshift-catalogd/catalogd-operator-lock: timed out waiting for the condition
      E0121 08:06:02.751276       1 main.go:344] "problem running manager" err="leader election lost" logger="setup"
      

      This bug is requesting that the ClusterOperator stay Available=True during this kind of brief hiccup, to conform with docs:

      Available=False means at least part of the component is non-functional, and that the condition requires immediate administrator intervention.

      and this run recovered quickly without requiring any admin intervention.

      Version-Release number of selected component

      CI Search turns up hits way back through 4.14, but probably only worth fixing in 4.18 and later when the component is GA (is this OLMv1?):

       curl -s 'https://search.dptools.openshift.org/search?search=Available.False.*Waiting+for+Deployment&maxAge=48h&type=junit&context=0' | jq -r 'keys[]' | grep -o '[-]4[.]1[0-9]' | sort | uniq -c | sort -n
            1 -4.14
            2 -4.16
            2 -4.17
            3 -4.15
           19 -4.18
           24 -4.19
      

      How reproducible

      Something like 2 to 36% impact for some flavors, based on:

      $ w3m -dump -cols 200 'https://search.dptools.openshift.org/?search=Available.False.*Waiting+for+Deployment&maxAge=48h&type=junit&context=0' | grep '[0-9][0-9] runs.*fail
      ures match'
      periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn-upgrade (all) - 71 runs, 27% failed, 11% of failures match = 3% impact
      pull-ci-openshift-installer-main-e2e-vsphere-static-ovn (all) - 21 runs, 14% failed, 100% of failures match = 14% impact
      openshift-machine-config-operator-4735-ci-4.19-e2e-azure-ovn-upgrade (all) - 10 runs, 30% failed, 33% of failures match = 10% impact
      pull-ci-openshift-installer-main-e2e-vsphere-ovn-multi-network (all) - 21 runs, 24% failed, 20% of failures match = 5% impact
      periodic-ci-openshift-release-master-okd-scos-4.18-e2e-vsphere-ovn (all) - 11 runs, 82% failed, 44% of failures match = 36% impact
      pull-ci-openshift-installer-main-e2e-azure-ovn-resourcegroup (all) - 19 runs, 37% failed, 14% of failures match = 5% impact
      periodic-ci-openshift-release-master-ci-4.19-e2e-aws-upgrade-ovn-single-node (all) - 71 runs, 54% failed, 3% of failures match = 1% impact
      periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn (all) - 12 runs, 58% failed, 43% of failures match = 25% impact
      periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade (all) - 60 runs, 40% failed, 4% of failures match = 2% impact
      periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-azure-ovn-upgrade (all) - 70 runs, 39% failed, 7% of failures match = 3% impact
      pull-ci-openshift-origin-master-e2e-aws-ovn-single-node (all) - 32 runs, 50% failed, 6% of failures match = 3% impact
      openshift-api-1997-ci-4.19-e2e-azure-ovn-upgrade (all) - 42 runs, 31% failed, 8% of failures match = 2% impact
      

      Steps to Reproduce

      1. Run lots of CI.
      2. Search the results for Available=False olm.

      Actual results

      Lots of search hits, see How reproducible.

      Expected results

      No matches.

      Additional info

      The end goal is to avoid alarming cluster admins with noise that they don't need to look into. But we want to continue alerting admins if something is actually stuck and likely needs admin intervention. Giving components at least a few minutes to see if a concerning issue recovers doesn't add much delay to admin response in the situations where we don't automatically recover, and avoids alarming folks in situations where we do automatically recover. And you can be more nuanced than "give it a few minutes to see if we auto recover" if you have a better grasp on the underlying issue than me

              lmohanty@redhat.com Lalatendu Mohanty
              trking W. Trevor King
              Jian Zhang Jian Zhang
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: