Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-42771

metal3-baremetal-operator gets scaled down to 0 if the node where it runs dies

XMLWordPrintable

    • Moderate
    • None
    • Metal Platform 262, Metal Platform 263, Metal Platform 264, Metal Platform 266
    • 4
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      OCP 4.14.34 BareMetal IPI IPv6 single stack cluster
      
      The metal3-baremetal-operator gets scaled down to 0 replicas when the node where the metal3-baremetal-operator pod is running dies, here is what happens:
      1. The node where the metal3-baremetal-operator-xxxxxxxxxx-yyyyy pod is running dies
      2. The cluster creates a new pod metal3-baremetal-operator-xxxxxxxxxx-zzzzz to try to keep up with the one replica needed
      3. The cluster marks for deletion the old metal3-baremetal-operator-xxxxxxxxxx-yyyyy pod
      4. Right after the metal3-baremetal-operator deployment gets scaled down to 0 automatically
      5. the new metal3-baremetal-operator-xxxxxxxxxx-zzzzz pod gets deleted
      
      Here are the relevant events collected in the openshift-machine-api namespace:
      2024-09-20T18:20:05Z   Warning   NodeNotReady              Pod             metal3-baremetal-operator-xxxxxxxxxx-yyyyy           Node is not ready
      2024-09-20T18:22:11Z   Normal    Scheduled                 Pod             metal3-baremetal-operator-xxxxxxxxxx-zzzzz           Successfully assigned openshift-machine-api/metal3-baremetal-operator-xxxxxxxxxx-zzzzz to master-1
      2024-09-20T18:22:11Z   Normal    SuccessfulCreate          ReplicaSet      metal3-baremetal-operator-xxxxxxxxxx                 Created pod: metal3-baremetal-operator-xxxxxxxxxx-zzzzz
      2024-09-20T18:22:11Z   Normal    TaintManagerEviction      Pod             metal3-baremetal-operator-xxxxxxxxxx-yyyyy           Marking for deletion Pod openshift-machine-api/metal3-baremetal-operator-xxxxxxxxxx-yyyyy
      2024-09-20T18:22:12Z   Normal    ScalingReplicaSet         Deployment      metal3-baremetal-operator                            Scaled down replica set metal3-baremetal-operator-xxxxxxxxxx to 0 from 1
      2024-09-20T18:22:12Z   Normal    SuccessfulDelete          ReplicaSet      metal3-baremetal-operator-xxxxxxxxxx                 Deleted pod: metal3-baremetal-operator-xxxxxxxxxx-zzzzz
      2024-09-20T18:22:14Z   Normal    AddedInterface            Pod             metal3-baremetal-operator-xxxxxxxxxx-zzzzz           Add eth0 [fd01:2:0:2::22/64] from ovn-kubernetes
      2024-09-20T18:22:14Z   Normal    Pulling                   Pod             metal3-baremetal-operator-xxxxxxxxxx-zzzzz           Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:559fd1b360a1e79a366564507f24c7e0002d02a44e5447759590734a58dbe214"
      2024-09-20T18:22:20Z   Normal    Created                   Pod             metal3-baremetal-operator-xxxxxxxxxx-zzzzz           Created container metal3-baremetal-operator
      2024-09-20T18:22:20Z   Normal    Pulled                    Pod             metal3-baremetal-operator-xxxxxxxxxx-zzzzz           Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:559fd1b360a1e79a366564507f24c7e0002d02a44e5447759590734a58dbe214" in 6.186656696s (6.18666779s including waiting)
      2024-09-20T18:22:20Z   Normal    Started                   Pod             metal3-baremetal-operator-xxxxxxxxxx-zzzzz           Started container metal3-baremetal-operator
      2024-09-20T18:22:21Z   Normal    Killing                   Pod             metal3-baremetal-operator-xxxxxxxxxx-zzzzz           Stopping container metal3-baremetal-operator

      Version-Release number of selected component (if applicable):

          

      How reproducible:

      I wasn't able to reproduce this problem in an IPv4 lab with the same OCP version, the customer was able to reproduce the same issue on another IPv6 cluster.

      Steps to Reproduce:

          1. shut down the node where the metal3-baremetal-operator runs

      Actual results:

      The old metal3-baremetal-operator pod remains in Terminating state, the new pod briefly respawns but right after it gets deleted because the related deployment is scaled down to 0

      Expected results:

      The metal3-baremetal-operator pod should respawn on another node, the other one should get terminated and the new pod should be able to turn on the node that has been shut down

      Additional info:

      Installed operators:
      $ oc get subs -A
      NAMESPACE                 NAME                               PACKAGE                            SOURCE                   CHANNEL
      multicluster-engine       multicluster-engine                multicluster-engine                webscale-acm-operators   stable-2.5
      open-cluster-management   acm-operator-subscription          advanced-cluster-management        webscale-acm-operators   release-2.10
      openshift-logging         cluster-logging                    cluster-logging                    webscale-acm-operators   stable
      openshift-logging         elasticsearch-operator             elasticsearch-operator             webscale-acm-operators   stable
      openshift-nmstate         kubernetes-nmstate-operator        kubernetes-nmstate-operator        webscale-acm-operators   stable
      openshift-operators       openshift-gitops-operator          openshift-gitops-operator          webscale-acm-operators   gitops-1.10
      openshift-operators       topology-aware-lifecycle-manager   topology-aware-lifecycle-manager   webscale-acm-operators   stable

              rh-ee-tdomnesc Tudor Domnescu
              fcristin1@redhat.com Francesco Cristini
              Jad Haj Yahya Jad Haj Yahya
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: