-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.14
-
Moderate
-
None
-
Metal Platform 262, Metal Platform 263, Metal Platform 264, Metal Platform 266
-
4
-
False
-
-
Description of problem:
OCP 4.14.34 BareMetal IPI IPv6 single stack cluster The metal3-baremetal-operator gets scaled down to 0 replicas when the node where the metal3-baremetal-operator pod is running dies, here is what happens: 1. The node where the metal3-baremetal-operator-xxxxxxxxxx-yyyyy pod is running dies 2. The cluster creates a new pod metal3-baremetal-operator-xxxxxxxxxx-zzzzz to try to keep up with the one replica needed 3. The cluster marks for deletion the old metal3-baremetal-operator-xxxxxxxxxx-yyyyy pod 4. Right after the metal3-baremetal-operator deployment gets scaled down to 0 automatically 5. the new metal3-baremetal-operator-xxxxxxxxxx-zzzzz pod gets deleted Here are the relevant events collected in the openshift-machine-api namespace: 2024-09-20T18:20:05Z Warning NodeNotReady Pod metal3-baremetal-operator-xxxxxxxxxx-yyyyy Node is not ready 2024-09-20T18:22:11Z Normal Scheduled Pod metal3-baremetal-operator-xxxxxxxxxx-zzzzz Successfully assigned openshift-machine-api/metal3-baremetal-operator-xxxxxxxxxx-zzzzz to master-1 2024-09-20T18:22:11Z Normal SuccessfulCreate ReplicaSet metal3-baremetal-operator-xxxxxxxxxx Created pod: metal3-baremetal-operator-xxxxxxxxxx-zzzzz 2024-09-20T18:22:11Z Normal TaintManagerEviction Pod metal3-baremetal-operator-xxxxxxxxxx-yyyyy Marking for deletion Pod openshift-machine-api/metal3-baremetal-operator-xxxxxxxxxx-yyyyy 2024-09-20T18:22:12Z Normal ScalingReplicaSet Deployment metal3-baremetal-operator Scaled down replica set metal3-baremetal-operator-xxxxxxxxxx to 0 from 1 2024-09-20T18:22:12Z Normal SuccessfulDelete ReplicaSet metal3-baremetal-operator-xxxxxxxxxx Deleted pod: metal3-baremetal-operator-xxxxxxxxxx-zzzzz 2024-09-20T18:22:14Z Normal AddedInterface Pod metal3-baremetal-operator-xxxxxxxxxx-zzzzz Add eth0 [fd01:2:0:2::22/64] from ovn-kubernetes 2024-09-20T18:22:14Z Normal Pulling Pod metal3-baremetal-operator-xxxxxxxxxx-zzzzz Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:559fd1b360a1e79a366564507f24c7e0002d02a44e5447759590734a58dbe214" 2024-09-20T18:22:20Z Normal Created Pod metal3-baremetal-operator-xxxxxxxxxx-zzzzz Created container metal3-baremetal-operator 2024-09-20T18:22:20Z Normal Pulled Pod metal3-baremetal-operator-xxxxxxxxxx-zzzzz Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:559fd1b360a1e79a366564507f24c7e0002d02a44e5447759590734a58dbe214" in 6.186656696s (6.18666779s including waiting) 2024-09-20T18:22:20Z Normal Started Pod metal3-baremetal-operator-xxxxxxxxxx-zzzzz Started container metal3-baremetal-operator 2024-09-20T18:22:21Z Normal Killing Pod metal3-baremetal-operator-xxxxxxxxxx-zzzzz Stopping container metal3-baremetal-operator
Version-Release number of selected component (if applicable):
How reproducible:
I wasn't able to reproduce this problem in an IPv4 lab with the same OCP version, the customer was able to reproduce the same issue on another IPv6 cluster.
Steps to Reproduce:
1. shut down the node where the metal3-baremetal-operator runs
Actual results:
The old metal3-baremetal-operator pod remains in Terminating state, the new pod briefly respawns but right after it gets deleted because the related deployment is scaled down to 0
Expected results:
The metal3-baremetal-operator pod should respawn on another node, the other one should get terminated and the new pod should be able to turn on the node that has been shut down
Additional info:
Installed operators: $ oc get subs -A NAMESPACE NAME PACKAGE SOURCE CHANNEL multicluster-engine multicluster-engine multicluster-engine webscale-acm-operators stable-2.5 open-cluster-management acm-operator-subscription advanced-cluster-management webscale-acm-operators release-2.10 openshift-logging cluster-logging cluster-logging webscale-acm-operators stable openshift-logging elasticsearch-operator elasticsearch-operator webscale-acm-operators stable openshift-nmstate kubernetes-nmstate-operator kubernetes-nmstate-operator webscale-acm-operators stable openshift-operators openshift-gitops-operator openshift-gitops-operator webscale-acm-operators gitops-1.10 openshift-operators topology-aware-lifecycle-manager topology-aware-lifecycle-manager webscale-acm-operators stable