Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-25766

Cluster Baremetal operator should use a leader lock

XMLWordPrintable

    • Moderate
    • No
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, the Bare Metal Operator (BMO) was not using a leader lock to control incoming and outgoing operator pod traffic. After a deployment included a new operator pod, the new pod competed with system resources, such as the `ClusterOperator` status, and this terminated any outgoing operator pods. This issue also impacted a cluster that did not include any bare-metal node. With this release, the BMO includes a leader lock to manage new pod traffic, and this fix resolves the competing pod issue. (link:https://issues.redhat.com/browse/OCPBUGS-25766[*OCPBUGS-25766*])
      Show
      * Previously, the Bare Metal Operator (BMO) was not using a leader lock to control incoming and outgoing operator pod traffic. After a deployment included a new operator pod, the new pod competed with system resources, such as the `ClusterOperator` status, and this terminated any outgoing operator pods. This issue also impacted a cluster that did not include any bare-metal node. With this release, the BMO includes a leader lock to manage new pod traffic, and this fix resolves the competing pod issue. (link: https://issues.redhat.com/browse/OCPBUGS-25766 [* OCPBUGS-25766 *])
    • Bug Fix
    • Done

      Description of problem:

      Seen in this 4.15 to 4.16 CI run:

      : [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers	0s
      {  event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times
      event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times}
      

      The operator recovered, and the update completed, but it's still probably worth cleaning up whatever's happening to avoid alarming anyone.

      Version-Release number of selected component (if applicable):

      Seems like all recent CI runs that match this string touch 4.15, 4.16, or development branches:

      $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=Back-off+restarting+failed+container+cluster-baremetal-operator+in+pod+cluster-baremetal-operator' | grep 'failures match'
      pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade-local-gateway (all) - 11 runs, 36% failed, 25% of failures match = 9% impact
      periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 15 runs, 20% failed, 33% of failures match = 7% impact
      pull-ci-openshift-kubernetes-master-e2e-aws-ovn-downgrade (all) - 3 runs, 67% failed, 50% of failures match = 33% impact
      periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 15 runs, 27% failed, 25% of failures match = 7% impact
      periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade (all) - 32 runs, 91% failed, 7% of failures match = 6% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 40 runs, 25% failed, 20% of failures match = 5% impact
      periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 3 runs, 33% failed, 100% of failures match = 33% impact
      pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-ovn-upgrade-out-of-change (all) - 4 runs, 25% failed, 100% of failures match = 25% impact
      periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade (all) - 40 runs, 8% failed, 33% of failures match = 3% impact
      pull-ci-openshift-azure-file-csi-driver-operator-main-e2e-azure-ovn-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
      periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 7 runs, 43% failed, 33% of failures match = 14% impact
      pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade (all) - 10 runs, 30% failed, 33% of failures match = 10% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-gcp-ovn-arm64 (all) - 6 runs, 33% failed, 50% of failures match = 17% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 11 runs, 18% failed, 50% of failures match = 9% impact
      

      How reproducible:
      
      Looks like ~8% impact.
      
      h2. Steps to Reproduce:
      
      1.  Run ~20 exposed job types.
      2. Check for {{: [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers}} failures with {{Back-off restarting failed container cluster-baremetal-operator}} messages.
      
      h2. Actual results:
      
      ~8% impact.
      
      h2. Expected results:
      
      ~0% impact.
      
      h2. Additional info:
      
      Dropping into Loki for the run I'd picked:
      
      {code:none}
      {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016"} | unpack | pod="cluster-baremetal-operator-574577fbcb-z8nd4" container="cluster-baremetal-operator" |~ "220 06:0"
      

      includes:

      E1220 06:04:18.794548       1 main.go:131] "unable to create controller" err="unable to put \"baremetal\" ClusterOperator in Available state: Operation cannot be fulfilled on clusteroperators.config.openshift.io \"baremetal\": the object has been modified; please apply your changes to the latest version and try again" controller="Provisioning"
      I1220 06:05:40.753364       1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080"
      I1220 06:05:40.766200       1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks
      I1220 06:05:40.780426       1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform"
      E1220 06:05:40.795555       1 main.go:131] "unable to create controller" err="unable to put \"baremetal\" ClusterOperator in Available state: Operation cannot be fulfilled on clusteroperators.config.openshift.io \"baremetal\": the object has been modified; please apply your changes to the latest version and try again" controller="Provisioning"
      I1220 06:08:21.730591       1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080"
      I1220 06:08:21.747466       1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks
      I1220 06:08:21.768138       1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform"
      E1220 06:08:21.781058       1 main.go:131] "unable to create controller" err="unable to put \"baremetal\" ClusterOperator in Available state: Operation cannot be fulfilled on clusteroperators.config.openshift.io \"baremetal\": the object has been modified; please apply your changes to the latest version and try again" controller="Provisioning"
      

      So some kind of ClusterOperator-modification race?

              trking W. Trevor King
              trking W. Trevor King
              Jad Haj Yahya Jad Haj Yahya
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: