Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.14
Component/s: Bare Metal Hardware Provisioning / baremetal-operator
Labels:
- triaged

Severity:
Moderate
Regression:
None
Sprint:
Metal Platform 262, Metal Platform 263, Metal Platform 264, Metal Platform 267, Metal Platform 268
sprint_count:
5
Blocked:
False
Blocked Reason:

Hide

None

Show
None
RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Description of problem:

OCP 4.14.34 BareMetal IPI IPv6 single stack cluster

The metal3-baremetal-operator gets scaled down to 0 replicas when the node where the metal3-baremetal-operator pod is running dies, here is what happens:
1. The node where the metal3-baremetal-operator-xxxxxxxxxx-yyyyy pod is running dies
2. The cluster creates a new pod metal3-baremetal-operator-xxxxxxxxxx-zzzzz to try to keep up with the one replica needed
3. The cluster marks for deletion the old metal3-baremetal-operator-xxxxxxxxxx-yyyyy pod
4. Right after the metal3-baremetal-operator deployment gets scaled down to 0 automatically
5. the new metal3-baremetal-operator-xxxxxxxxxx-zzzzz pod gets deleted

Here are the relevant events collected in the openshift-machine-api namespace:
2024-09-20T18:20:05Z   Warning   NodeNotReady              Pod             metal3-baremetal-operator-xxxxxxxxxx-yyyyy           Node is not ready
2024-09-20T18:22:11Z   Normal    Scheduled                 Pod             metal3-baremetal-operator-xxxxxxxxxx-zzzzz           Successfully assigned openshift-machine-api/metal3-baremetal-operator-xxxxxxxxxx-zzzzz to master-1
2024-09-20T18:22:11Z   Normal    SuccessfulCreate          ReplicaSet      metal3-baremetal-operator-xxxxxxxxxx                 Created pod: metal3-baremetal-operator-xxxxxxxxxx-zzzzz
2024-09-20T18:22:11Z   Normal    TaintManagerEviction      Pod             metal3-baremetal-operator-xxxxxxxxxx-yyyyy           Marking for deletion Pod openshift-machine-api/metal3-baremetal-operator-xxxxxxxxxx-yyyyy
2024-09-20T18:22:12Z   Normal    ScalingReplicaSet         Deployment      metal3-baremetal-operator                            Scaled down replica set metal3-baremetal-operator-xxxxxxxxxx to 0 from 1
2024-09-20T18:22:12Z   Normal    SuccessfulDelete          ReplicaSet      metal3-baremetal-operator-xxxxxxxxxx                 Deleted pod: metal3-baremetal-operator-xxxxxxxxxx-zzzzz
2024-09-20T18:22:14Z   Normal    AddedInterface            Pod             metal3-baremetal-operator-xxxxxxxxxx-zzzzz           Add eth0 [fd01:2:0:2::22/64] from ovn-kubernetes
2024-09-20T18:22:14Z   Normal    Pulling                   Pod             metal3-baremetal-operator-xxxxxxxxxx-zzzzz           Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:559fd1b360a1e79a366564507f24c7e0002d02a44e5447759590734a58dbe214"
2024-09-20T18:22:20Z   Normal    Created                   Pod             metal3-baremetal-operator-xxxxxxxxxx-zzzzz           Created container metal3-baremetal-operator
2024-09-20T18:22:20Z   Normal    Pulled                    Pod             metal3-baremetal-operator-xxxxxxxxxx-zzzzz           Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:559fd1b360a1e79a366564507f24c7e0002d02a44e5447759590734a58dbe214" in 6.186656696s (6.18666779s including waiting)
2024-09-20T18:22:20Z   Normal    Started                   Pod             metal3-baremetal-operator-xxxxxxxxxx-zzzzz           Started container metal3-baremetal-operator
2024-09-20T18:22:21Z   Normal    Killing                   Pod             metal3-baremetal-operator-xxxxxxxxxx-zzzzz           Stopping container metal3-baremetal-operator

Version-Release number of selected component (if applicable):

How reproducible:

I wasn't able to reproduce this problem in an IPv4 lab with the same OCP version, the customer was able to reproduce the same issue on another IPv6 cluster.

Steps to Reproduce:

    1. shut down the node where the metal3-baremetal-operator runs

Actual results:

The old metal3-baremetal-operator pod remains in Terminating state, the new pod briefly respawns but right after it gets deleted because the related deployment is scaled down to 0

Expected results:

The metal3-baremetal-operator pod should respawn on another node, the other one should get terminated and the new pod should be able to turn on the node that has been shut down

Additional info:

Installed operators:
$ oc get subs -A
NAMESPACE                 NAME                               PACKAGE                            SOURCE                   CHANNEL
multicluster-engine       multicluster-engine                multicluster-engine                webscale-acm-operators   stable-2.5
open-cluster-management   acm-operator-subscription          advanced-cluster-management        webscale-acm-operators   release-2.10
openshift-logging         cluster-logging                    cluster-logging                    webscale-acm-operators   stable
openshift-logging         elasticsearch-operator             elasticsearch-operator             webscale-acm-operators   stable
openshift-nmstate         kubernetes-nmstate-operator        kubernetes-nmstate-operator        webscale-acm-operators   stable
openshift-operators       openshift-gitops-operator          openshift-gitops-operator          webscale-acm-operators   gitops-1.10
openshift-operators       topology-aware-lifecycle-manager   topology-aware-lifecycle-manager   webscale-acm-operators   stable

Assignee:: Tudor Domnescu

Reporter:: Francesco Cristini

QA Contact:: Jad Haj Yahya

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/10/04 11:31 AM

Updated:: 2025/03/12 10:43 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates