Loading...

XML

Word

Printable

Type: Bug
Resolution: Can't Do
Priority: Major
Fix Version/s: None
Affects Version/s: 4.18.z
Component/s: Bare Metal Hardware Provisioning / baremetal-operator
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
2
Severity:
None
Regression:
None

Target Backport Versions:

4.18.z, 4.19.z
Target Version:

4.20.0
Release Blocker:
None
Sprint:
Metal Platform 274, Metal Platform 278
sprint_count:
2

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

The customer reported that in order for the BaremetalHost to be provisioned, both metal3 pods have to be running on the same node.

---oc get pods -A -owide| grep metal3                                                                                                                                                                       cluster-prod-edge-spoke1-dc-sin2
openshift-machine-api                              metal3-66f78c98bb-4gtqn                                                     4/4     Running                  0             31h     10.68.2.1      control01.itup-002.example.com   <none>           <none>
openshift-machine-api                              metal3-baremetal-operator-9dc676f77-gn75t                                   1/1     Running                  0             31h     172.16.0.72    control01.itup-002.example.com   <none>           <none>
-------------------------------------------

From the infrastructure-operator pod, port 6388/tcp is not reaching the metal3-state service port. Because of this, the customer is unable to deploy a new spoke cluster. To fix the issue, a custom security group rule had to be added to the security group attached to the master nodes.

---ERROR ---
{"level":"info","ts":1749121148.4973779,"logger":"provisioner.ironic","msg":"error caught while checking endpoint, will retry","host":"example~control02","endpoint":"https://metal3-state.openshift-machine-api.svc.cluster.local:6388/v1/","error":"Get \"https://metal3-state.openshift-machine-api.svc.cluster.local:6388/v1/\": dial tcp 172.30.34.248:6388: i/o timeout"}
---


--- BEFORE ADDING SG RULE ---
NAME                                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
metal3-state                           ClusterIP   172.30.73.28     <none>        6388/TCP,6180/TCP,6183/TCP   18h


bash-5.1 ~ $ oc rsh infrastructure-operator-6c5698fffb-v8lmg 
sh-5.1$ curl https://172.30.73.28:6388                       
curl: (28) Failed to connect to 172.30.73.28 port 6388: Connection timed out
-----------------------------

--- AFTER ADDING SG RULE ---
sh-5.1$ curl https://172.30.73.28:6388 -kv
*   Trying 172.30.73.28:6388...
* Connected to 172.30.73.28 (172.30.73.28) port 6388 (#0)
.
.
.
----------------------------

So as the traffic from metal3-state is exposed in pods that run in host network context, I think that traffic is not allowed by default and needs to be added manually. 


Is this expected?

Version-Release number of selected component (if applicable):

How reproducible:

    always

Steps to Reproduce:

    1. 
    2. 
    3.

Actual results:

When metal3 pods are not running on the same node, the infraustrcture-operator is unable to connect to metal3 pods.

Expected results:

The infrastructure operator should able to connect metal3 pods.

Additional info:

Assignee:: Dmitry Tantsur

Reporter:: Daniel Chong

Need Info From:: Daniel Chong

Contributors:: None

QA Contact:: Jad Haj Yahya

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2025/06/23 3:30 PM

Updated:: 2025/10/09 2:46 PM

Resolved:: 2025/10/01 11:26 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates