Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: None
Affects Version/s: 4.17.z
Component/s: Networking / SR-IOV
Labels:
- SRIOV

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
3.5
Severity:
Important
Regression:
Yes

Target Backport Versions:
None
Target Version:

4.17.z
Release Blocker:
None
Sprint:
NHE Sprint 268
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
Done
Release Note Type:
Bug Fix
Release Note Text:

Hide
* Previously, the Single Root I/O Virtualization (SR-IOV) network config daemon unbinded network drivers from the physical function (PF) interface instead of unbinding the drivers from the virtual function (VF) interface when SR-IOV was configured with an InfiniBand (IB) type. This unbinding workflow removed the IB interface from the node, and this sitaution made the IB interface non-functional. With this release, a fix to the SR-IOV network config daemon ensures that the IB interface remains functional when it correctly unbinds the VF network interface. Additionally, the SR-IOV Network Operator targets the network drivers of a VF interface instead of the PF interface when configuring SR-IOV with an IB type. (link:https://issues.redhat.com/browse/OCPBUGS-53254[*~~OCPBUGS-53254~~*])

Show
* Previously, the Single Root I/O Virtualization (SR-IOV) network config daemon unbinded network drivers from the physical function (PF) interface instead of unbinding the drivers from the virtual function (VF) interface when SR-IOV was configured with an InfiniBand (IB) type. This unbinding workflow removed the IB interface from the node, and this sitaution made the IB interface non-functional. With this release, a fix to the SR-IOV network config daemon ensures that the IB interface remains functional when it correctly unbinds the VF network interface. Additionally, the SR-IOV Network Operator targets the network drivers of a VF interface instead of the PF interface when configuring SR-IOV with an IB type. (link: https://issues.redhat.com/browse/OCPBUGS-53254 [* OCPBUGS-53254 *])

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

 cu is observing node reboot churn when configuring IB IFsVFs in 4.17.z: the IF/VF gets configured (sans the VF GUID), but the cu observed abt 24h of reboot churns before everything was stabilised; cu believes this is a regression between 4.16.z/4.17.z due to the lack of a commit associated w the following PRs:

the behaviour being seen is reported as follows by the cu:
Created By: Julien Dethurens  (3/5/2025 9:51 AM [MT])
...
After more than 24 hours of reboots, sk0be-005x reached a state where the virtual functions are configured and the interface was not lost. I uploaded a sos report -k crio.all=on -k crio.logs=on  -k podman.all=on -k podman.logs=on -a of the node in this state as sosreport-sk0be-005x-04071984-2025-03-05-gdfdixi.tar.xz and a oc adm must-gather --image-stream=openshift/must-gather --image=registry.redhat.io/openshift4/ose-sriov-operator-must-gather as 04071984_must-gather.local.6300288682719350670.tar.gz. The status.allocatable for the node has openshift.io/sriov_ib: "64" and the test pod was scheduled on it.
The pod is stuck in ContainerCreating state, with this error appearing in its events and in the logs of the multus-z8cf9 pod (for sk0be-005x) in the openshift-multus namespace: error adding container to network "proj-k8s-ssc-hpci-sriov-ib-network": infiniBand SRI-OV CNI failed to configure VF "VF ibs1f0v18 GUID is not valid".
Indeed, the VFs all have guid: "00:00:00:00:00:00:00:00" in the SriovNetworkNodeState for sk0be-005x and the node_guid file in sysfs shows 0000:0000:0000:0000:
sh-5.1# ethtool -i ibs1f0v18
driver: mlx5_core[ib_ipoib]
version: 25.01-0.6.0
firmware-version: 20.42.1000 (MT_0000000453)
expansion-rom-version: 
bus-info: 0000:03:02.4
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes
sh-5.1# cat /sys/bus/pci/devices/0000\:03\:02.4/infiniband/mlx5_20/node_guid
0000:0000:0000:0000

at the time of that comment, another node was in the process of the reboot churn.

i see two things that are concerning, the node reboot churn AND the subsequent failure of a pod to use a VF due to an invalid GUID. the cu believes this to be a regression, but dont have the knowledge to validate that and so ask it here.

Version-Release number of selected component (if applicable):

4.17.z

How reproducible:

unsure

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

node goes into reboot churn resulting in IB VFs lacking valid GUID and so are not attachable

Expected results:

VFs get configured a valid GUID and node does not spend a day in a reboot churn

Additional info:

depends on

OCPBUGS-53281 SRIOV PF got unbind instead of VF in case of IB link type

Closed

links to

openshift/sriov-network-operator#1065: [release-4.17] OCPBUGS-53254: SRIOV PF got unbind instead of VF in case of IB link type

RHBA-2025:3564 OpenShift Container Platform 4.17.z extras update

Assignee:: William Zhao

Reporter:: milti leonard

Need Info From:: None

Contributors:: None

QA Contact:: Zhiqiang Fang

Doc Contact:: None

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2025/03/18 3:53 PM

Updated:: 2025/09/13 4:58 AM

Resolved:: 2025/04/09 4:13 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates