-
Bug
-
Resolution: Done-Errata
-
Major
-
None
-
4.17.z
-
Quality / Stability / Reliability
-
False
-
-
3.5
-
Important
-
Yes
-
None
-
None
-
NHE Sprint 268
-
1
-
Done
-
Bug Fix
-
-
None
-
None
-
None
-
None
Description of problem:
cu is observing node reboot churn when configuring IB IFsVFs in 4.17.z: the IF/VF gets configured (sans the VF GUID), but the cu observed abt 24h of reboot churns before everything was stabilised; cu believes this is a regression between 4.16.z/4.17.z due to the lack of a commit associated w the following PRs:
- https://github.com/openshift/sriov-network-operator/commit/dc299c464d838a4d73dffe1978cc9edac0bc64fb
- https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/797
the behaviour being seen is reported as follows by the cu: Created By: Julien Dethurens (3/5/2025 9:51 AM [MT]) ... After more than 24 hours of reboots, sk0be-005x reached a state where the virtual functions are configured and the interface was not lost. I uploaded a sos report -k crio.all=on -k crio.logs=on -k podman.all=on -k podman.logs=on -a of the node in this state as sosreport-sk0be-005x-04071984-2025-03-05-gdfdixi.tar.xz and a oc adm must-gather --image-stream=openshift/must-gather --image=registry.redhat.io/openshift4/ose-sriov-operator-must-gather as 04071984_must-gather.local.6300288682719350670.tar.gz. The status.allocatable for the node has openshift.io/sriov_ib: "64" and the test pod was scheduled on it. The pod is stuck in ContainerCreating state, with this error appearing in its events and in the logs of the multus-z8cf9 pod (for sk0be-005x) in the openshift-multus namespace: error adding container to network "proj-k8s-ssc-hpci-sriov-ib-network": infiniBand SRI-OV CNI failed to configure VF "VF ibs1f0v18 GUID is not valid". Indeed, the VFs all have guid: "00:00:00:00:00:00:00:00" in the SriovNetworkNodeState for sk0be-005x and the node_guid file in sysfs shows 0000:0000:0000:0000: sh-5.1# ethtool -i ibs1f0v18 driver: mlx5_core[ib_ipoib] version: 25.01-0.6.0 firmware-version: 20.42.1000 (MT_0000000453) expansion-rom-version: bus-info: 0000:03:02.4 supports-statistics: yes supports-test: yes supports-eeprom-access: no supports-register-dump: no supports-priv-flags: yes sh-5.1# cat /sys/bus/pci/devices/0000\:03\:02.4/infiniband/mlx5_20/node_guid 0000:0000:0000:0000 at the time of that comment, another node was in the process of the reboot churn. i see two things that are concerning, the node reboot churn AND the subsequent failure of a pod to use a VF due to an invalid GUID. the cu believes this to be a regression, but dont have the knowledge to validate that and so ask it here.
Version-Release number of selected component (if applicable):
4.17.z
How reproducible:
unsure
Steps to Reproduce:
1. 2. 3.
Actual results:
node goes into reboot churn resulting in IB VFs lacking valid GUID and so are not attachable
Expected results:
VFs get configured a valid GUID and node does not spend a day in a reboot churn
Additional info:
- depends on
-
OCPBUGS-53281 SRIOV PF got unbind instead of VF in case of IB link type
-
- Closed
-
- links to
-
RHBA-2025:3564 OpenShift Container Platform 4.17.z extras update