Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-57958

[baremetal] Port TCP 6388 is not flowing from ACM infrastructure-operator to the service port

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 2
    • None
    • None
    • None
    • Metal Platform 274, Metal Platform 278
    • 2
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      The customer reported that in order for the BaremetalHost to be provisioned, both metal3 pods have to be running on the same node.
      
      ---oc get pods -A -owide| grep metal3                                                                                                                                                                       cluster-prod-edge-spoke1-dc-sin2
      openshift-machine-api                              metal3-66f78c98bb-4gtqn                                                     4/4     Running                  0             31h     10.68.2.1      control01.itup-002.example.com   <none>           <none>
      openshift-machine-api                              metal3-baremetal-operator-9dc676f77-gn75t                                   1/1     Running                  0             31h     172.16.0.72    control01.itup-002.example.com   <none>           <none>
      -------------------------------------------
      
      From the infrastructure-operator pod, port 6388/tcp is not reaching the metal3-state service port. Because of this, the customer is unable to deploy a new spoke cluster. To fix the issue, a custom security group rule had to be added to the security group attached to the master nodes.
      
      ---ERROR ---
      {"level":"info","ts":1749121148.4973779,"logger":"provisioner.ironic","msg":"error caught while checking endpoint, will retry","host":"example~control02","endpoint":"https://metal3-state.openshift-machine-api.svc.cluster.local:6388/v1/","error":"Get \"https://metal3-state.openshift-machine-api.svc.cluster.local:6388/v1/\": dial tcp 172.30.34.248:6388: i/o timeout"}
      ---
      
      
      --- BEFORE ADDING SG RULE ---
      NAME                                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
      metal3-state                           ClusterIP   172.30.73.28     <none>        6388/TCP,6180/TCP,6183/TCP   18h
      
      
      bash-5.1 ~ $ oc rsh infrastructure-operator-6c5698fffb-v8lmg 
      sh-5.1$ curl https://172.30.73.28:6388                       
      curl: (28) Failed to connect to 172.30.73.28 port 6388: Connection timed out
      -----------------------------
      
      --- AFTER ADDING SG RULE ---
      sh-5.1$ curl https://172.30.73.28:6388 -kv
      *   Trying 172.30.73.28:6388...
      * Connected to 172.30.73.28 (172.30.73.28) port 6388 (#0)
      .
      .
      .
      ----------------------------
      
      So as the traffic from metal3-state is exposed in pods that run in host network context, I think that traffic is not allowed by default and needs to be added manually. 
      
      
      Is this expected? 

      Version-Release number of selected component (if applicable):

          

      How reproducible:

          always

      Steps to Reproduce:

          1. 
          2. 
          3.
          

      Actual results:

      When metal3 pods are not running on the same node, the infraustrcture-operator is unable to connect to metal3 pods. 

      Expected results:

      The infrastructure operator should able to connect metal3 pods.    

      Additional info:

       

              rhn-engineering-dtantsur Dmitry Tantsur
              rhn-support-dchong Daniel Chong
              Daniel Chong
              None
              Jad Haj Yahya Jad Haj Yahya
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: