Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-56049

ipi install if dell/AMD servers unable to build cluster because interface names are not being assigned correctly

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 3
    • Critical
    • Yes
    • None
    • Metal Platform 271, Metal Platform 277, Metal Platform 278
    • 3
    • In Progress
    • Release Note Not Required
    • None

      Description of problem:

      During install, one or more systems have a problem assigning the correct interface name. for example, idrac shows interface ens2f0 on slot 1 port 1  whereas on the system, it shows ens2f0 as slot1 port 1 via mac address. The system with the different interface assingment casnnot join the cluster and eventually becomes unreachable (looses it's ip address)
          

      Version-Release number of selected component (if applicable):

      OPC 4.18.4, rehl 4.9
          

      How reproducible:

      The cluster installation fails fails at least 9 out of 10 times
          

      Steps to Reproduce:

          1. ipi install initiated
          2. bootstrap initially online
          3. installation fails
          

      Actual results:

      
          

      Expected results:

      
      successful ocp install
      
          

      Additional info:

      
      Customer has installed the same configuration on other systems without problems
      This is the first install will Dell AMD servers. 
      The interface assignment for the servers is not predicable/consistent. 
      Customer tried initially with bonds but then removed the bonds to simplify the situation and ranintothe same problem
      
      ocp version 4.18.9
      redfish
      
      3 node bmh cluster
      
      Only one is accessable
      
      oc gen no -owide
      
      master-0.dellamd.mavdallab.com   Ready    control-plane,master,worker   20m   v1.31.7   10.69.26.97
      
      
      bootstrap    10.69.26.219/10.69.26.95(api)
      
      master-1.dellamd.mavdallab.com" 10.69.26.93 - comes online then drops
      
      master-0.dellamd.mavdallab.com  10.69.26.97 - online after reboot
      
      master-2.dellamd.mavdallab.com  10.69.26.94 - never comes online
      
      May 02 18:03:06 localhost.localdomain baremetal-operator[5558]: {"level":"info","ts":1746208986.411257,"logger":"controllers.BareMetalHost.host_config_data","msg":"PreprovisioningNetworkData networkData key is not set, returning empty data","baremetalhost":{"name":"master-1.dellamd.mavdallab.com","namespace":"openshift-machine-api"},"provisioningState":"provisioned"}
      
      
      time="2025-05-01T11:36:24-05:00" level=info msg="  baremetalhost: master-0.dellamd.mavdallab.com: uninitialized"
      time="2025-05-01T11:36:25-05:00" level=info msg="  baremetalhost: master-1.dellamd.mavdallab.com: uninitialized"
      time="2025-05-01T11:36:25-05:00" level=info msg="  baremetalhost: master-2.dellamd.mavdallab.com: uninitialized"
      time="2025-05-01T11:36:41-05:00" level=info msg="  baremetalhost: master-1.dellamd.mavdallab.com: registering"
      time="2025-05-01T11:36:42-05:00" level=info msg="  baremetalhost: master-0.dellamd.mavdallab.com: registering"
      time="2025-05-01T11:36:42-05:00" level=info msg="  baremetalhost: master-2.dellamd.mavdallab.com: registering"
      time="2025-05-01T11:38:36-05:00" level=info msg="  baremetalhost: master-0.dellamd.mavdallab.com: inspecting"
      time="2025-05-01T11:38:36-05:00" level=info msg="  baremetalhost: master-2.dellamd.mavdallab.com: inspecting"
      time="2025-05-01T11:38:36-05:00" level=info msg="  baremetalhost: master-1.dellamd.mavdallab.com: inspecting"
      time="2025-05-01T11:51:45-05:00" level=info msg="  baremetalhost: master-1.dellamd.mavdallab.com: preparing"
      time="2025-05-01T11:51:45-05:00" level=info msg="  baremetalhost: master-1.dellamd.mavdallab.com: available"
      time="2025-05-01T11:51:45-05:00" level=info msg="  baremetalhost: master-1.dellamd.mavdallab.com: provisioning"
      time="2025-05-01T12:07:58-05:00" level=info msg="  baremetalhost: master-1.dellamd.mavdallab.com: provisioned"
      time="2025-05-01T12:19:24-05:00" level=info msg="  baremetalhost: master-2.dellamd.mavdallab.com: preparing"
      time="2025-05-01T12:19:25-05:00" level=info msg="  baremetalhost: master-2.dellamd.mavdallab.com: available"
      time="2025-05-01T12:19:25-05:00" level=info msg="  baremetalhost: master-2.dellamd.mavdallab.com: provisioning"
      time="2025-05-01T12:20:54-05:00" level=info msg="  baremetalhost: master-0.dellamd.mavdallab.com: preparing"
      time="2025-05-01T12:20:55-05:00" level=info msg="  baremetalhost: master-0.dellamd.mavdallab.com: available"
      time="2025-05-01T12:20:55-05:00" level=info msg="  baremetalhost: master-0.dellamd.mavdallab.com: provisioning"
      time="2025-05-01T12:34:58-05:00" level=error msg="Cluster operator authentication Degraded is True with 
      
      omc get machines -A
      NAMESPACE               NAME                     PHASE          TYPE   REGION   ZONE   AGE   NODE                   PROVIDERID                       STATE
      openshift-machine-api   dellamd-kzqtm-master-0   Running                               1h    2025-05-02T17:14:28Z   master-0.dellamd.mavdallab.com   baremetalhost:///openshift-machine-api/master-0.dellamd.mavdallab.com/17653a53-d2c3-4ce8-bfcf-a5d06d311e2f
      openshift-machine-api   dellamd-kzqtm-master-1   Provisioned                           1h    2025-05-02T17:14:29Z                                    baremetalhost:///openshift-machine-api/master-1.dellamd.mavdallab.com/ecbd117c-a598-4a91-8305-bee503e8bbc3
      openshift-machine-api   dellamd-kzqtm-master-2   Provisioning                          1h    2025-05-02T17:14:29Z
      
      
      Looking at sosreport-master-0-04131365-2025-05-02-gdqnsui
      
      May 02 17:52:06 localhost systemd-udevd[2928]: ens5f0: Failed to rename network interface 6 from 'eth2' to 'ens5f0': File exists
      May 02 17:52:06 localhost systemd-udevd[2980]: ens2f1: Failed to rename network interface 9 from 'eth5' to 'ens2f1': File exists
      May 02 17:52:06 localhost systemd-udevd[3025]: ens2f0: Failed to rename network interface 10 from 'eth6' to 'ens2f0': File exists
      May 02 17:52:06 localhost systemd-udevd[3731]: ens5f1: Failed to rename network interface 7 from 'eth3' to 'ens5f1': File exists
      It is trying to rename some network interfaces reusing the existing names (ens2f* and ens5f*). This seems to match the issue description for https://access.redhat.com/solutions/7112603 and https://issues.redhat.com/browse/RHEL-44630 . From the Jira issue, I can see the lspci output ("sos_commands/pci/lspci_-tv" in the sosreport) shows the same structure with two cards hanging from the same IOMMU root complex:
      
      
      -+-[0000:c0]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device 14a4
       |           +-00.2  Advanced Micro Devices, Inc. [AMD] Device 149e
       |           +-00.3  Advanced Micro Devices, Inc. [AMD] Device 14a6
       |           +-01.0  Advanced Micro Devices, Inc. [AMD] Device 149f
       |           +-01.1-[c4]--+-00.0  Intel Corporation Ethernet Controller E810-XXV for SFP
       |           |            \-00.1  Intel Corporation Ethernet Controller E810-XXV for SFP
       |           +-01.2-[c5]--+-00.0  Intel Corporation Ethernet Controller E810-XXV for SFP
       |           |            \-00.1  Intel Corporation Ethernet Controller E810-XXV for SFP
       |           +-02.0  Advanced Micro Devices, Inc. [AMD] Device 149f
       |           +-03.0  Advanced Micro Devices, Inc. [AMD] Device 149f
      ...
      
      
      
      
      
      
      
      
      
          

              sgoveas@redhat.com Steeve Goveas
              rhn-support-brstone Brian Stone
              None
              None
              Steeve Goveas Steeve Goveas
              None
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

                Created:
                Updated: