-
Bug
-
Resolution: Unresolved
-
Major
-
4.18.z
Description of problem:
During install, one or more systems have a problem assigning the correct interface name. for example, idrac shows interface ens2f0 on slot 1 port 1 whereas on the system, it shows ens2f0 as slot1 port 1 via mac address. The system with the different interface assingment casnnot join the cluster and eventually becomes unreachable (looses it's ip address)
Version-Release number of selected component (if applicable):
OPC 4.18.4, rehl 4.9
How reproducible:
The cluster installation fails fails at least 9 out of 10 times
Steps to Reproduce:
1. ipi install initiated
2. bootstrap initially online
3. installation fails
Actual results:
Expected results:
successful ocp install
Additional info:
Customer has installed the same configuration on other systems without problems
This is the first install will Dell AMD servers.
The interface assignment for the servers is not predicable/consistent.
Customer tried initially with bonds but then removed the bonds to simplify the situation and ranintothe same problem
ocp version 4.18.9
redfish
3 node bmh cluster
Only one is accessable
oc gen no -owide
master-0.dellamd.mavdallab.com Ready control-plane,master,worker 20m v1.31.7 10.69.26.97
bootstrap 10.69.26.219/10.69.26.95(api)
master-1.dellamd.mavdallab.com" 10.69.26.93 - comes online then drops
master-0.dellamd.mavdallab.com 10.69.26.97 - online after reboot
master-2.dellamd.mavdallab.com 10.69.26.94 - never comes online
May 02 18:03:06 localhost.localdomain baremetal-operator[5558]: {"level":"info","ts":1746208986.411257,"logger":"controllers.BareMetalHost.host_config_data","msg":"PreprovisioningNetworkData networkData key is not set, returning empty data","baremetalhost":{"name":"master-1.dellamd.mavdallab.com","namespace":"openshift-machine-api"},"provisioningState":"provisioned"}
time="2025-05-01T11:36:24-05:00" level=info msg=" baremetalhost: master-0.dellamd.mavdallab.com: uninitialized"
time="2025-05-01T11:36:25-05:00" level=info msg=" baremetalhost: master-1.dellamd.mavdallab.com: uninitialized"
time="2025-05-01T11:36:25-05:00" level=info msg=" baremetalhost: master-2.dellamd.mavdallab.com: uninitialized"
time="2025-05-01T11:36:41-05:00" level=info msg=" baremetalhost: master-1.dellamd.mavdallab.com: registering"
time="2025-05-01T11:36:42-05:00" level=info msg=" baremetalhost: master-0.dellamd.mavdallab.com: registering"
time="2025-05-01T11:36:42-05:00" level=info msg=" baremetalhost: master-2.dellamd.mavdallab.com: registering"
time="2025-05-01T11:38:36-05:00" level=info msg=" baremetalhost: master-0.dellamd.mavdallab.com: inspecting"
time="2025-05-01T11:38:36-05:00" level=info msg=" baremetalhost: master-2.dellamd.mavdallab.com: inspecting"
time="2025-05-01T11:38:36-05:00" level=info msg=" baremetalhost: master-1.dellamd.mavdallab.com: inspecting"
time="2025-05-01T11:51:45-05:00" level=info msg=" baremetalhost: master-1.dellamd.mavdallab.com: preparing"
time="2025-05-01T11:51:45-05:00" level=info msg=" baremetalhost: master-1.dellamd.mavdallab.com: available"
time="2025-05-01T11:51:45-05:00" level=info msg=" baremetalhost: master-1.dellamd.mavdallab.com: provisioning"
time="2025-05-01T12:07:58-05:00" level=info msg=" baremetalhost: master-1.dellamd.mavdallab.com: provisioned"
time="2025-05-01T12:19:24-05:00" level=info msg=" baremetalhost: master-2.dellamd.mavdallab.com: preparing"
time="2025-05-01T12:19:25-05:00" level=info msg=" baremetalhost: master-2.dellamd.mavdallab.com: available"
time="2025-05-01T12:19:25-05:00" level=info msg=" baremetalhost: master-2.dellamd.mavdallab.com: provisioning"
time="2025-05-01T12:20:54-05:00" level=info msg=" baremetalhost: master-0.dellamd.mavdallab.com: preparing"
time="2025-05-01T12:20:55-05:00" level=info msg=" baremetalhost: master-0.dellamd.mavdallab.com: available"
time="2025-05-01T12:20:55-05:00" level=info msg=" baremetalhost: master-0.dellamd.mavdallab.com: provisioning"
time="2025-05-01T12:34:58-05:00" level=error msg="Cluster operator authentication Degraded is True with
omc get machines -A
NAMESPACE NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
openshift-machine-api dellamd-kzqtm-master-0 Running 1h 2025-05-02T17:14:28Z master-0.dellamd.mavdallab.com baremetalhost:///openshift-machine-api/master-0.dellamd.mavdallab.com/17653a53-d2c3-4ce8-bfcf-a5d06d311e2f
openshift-machine-api dellamd-kzqtm-master-1 Provisioned 1h 2025-05-02T17:14:29Z baremetalhost:///openshift-machine-api/master-1.dellamd.mavdallab.com/ecbd117c-a598-4a91-8305-bee503e8bbc3
openshift-machine-api dellamd-kzqtm-master-2 Provisioning 1h 2025-05-02T17:14:29Z
Looking at sosreport-master-0-04131365-2025-05-02-gdqnsui
May 02 17:52:06 localhost systemd-udevd[2928]: ens5f0: Failed to rename network interface 6 from 'eth2' to 'ens5f0': File exists
May 02 17:52:06 localhost systemd-udevd[2980]: ens2f1: Failed to rename network interface 9 from 'eth5' to 'ens2f1': File exists
May 02 17:52:06 localhost systemd-udevd[3025]: ens2f0: Failed to rename network interface 10 from 'eth6' to 'ens2f0': File exists
May 02 17:52:06 localhost systemd-udevd[3731]: ens5f1: Failed to rename network interface 7 from 'eth3' to 'ens5f1': File exists
It is trying to rename some network interfaces reusing the existing names (ens2f* and ens5f*). This seems to match the issue description for https://access.redhat.com/solutions/7112603 and https://issues.redhat.com/browse/RHEL-44630 . From the Jira issue, I can see the lspci output ("sos_commands/pci/lspci_-tv" in the sosreport) shows the same structure with two cards hanging from the same IOMMU root complex:
-+-[0000:c0]-+-00.0 Advanced Micro Devices, Inc. [AMD] Device 14a4
| +-00.2 Advanced Micro Devices, Inc. [AMD] Device 149e
| +-00.3 Advanced Micro Devices, Inc. [AMD] Device 14a6
| +-01.0 Advanced Micro Devices, Inc. [AMD] Device 149f
| +-01.1-[c4]--+-00.0 Intel Corporation Ethernet Controller E810-XXV for SFP
| | \-00.1 Intel Corporation Ethernet Controller E810-XXV for SFP
| +-01.2-[c5]--+-00.0 Intel Corporation Ethernet Controller E810-XXV for SFP
| | \-00.1 Intel Corporation Ethernet Controller E810-XXV for SFP
| +-02.0 Advanced Micro Devices, Inc. [AMD] Device 149f
| +-03.0 Advanced Micro Devices, Inc. [AMD] Device 149f
...
- is blocked by
-
OCPBUGS-62739 Need new CoreOS boot image with nmstate-2.2.50
-
- POST
-
- is duplicated by
-
OCPBUGS-56001 attemping to install openshift ipi on 3 bmh Dell servers with AMD ethernet cards. One of the systems will not get the correct name for the interface and the install will fail
-
- Closed
-
- links to