Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-53254

cu observes node reboot churn of up to 24h when configuring IB IFs/VFs using SRIOV operator, the VFs GUID does not appear to get set properly: believes it a regression

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Major Major
    • None
    • 4.17.z
    • Networking / SR-IOV
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 3.5
    • Important
    • Yes
    • None
    • None
    • NHE Sprint 268
    • 1
    • Done
    • Bug Fix
    • Hide
      * Previously, the Single Root I/O Virtualization (SR-IOV) network config daemon unbinded network drivers from the physical function (PF) interface instead of unbinding the drivers from the virtual function (VF) interface when SR-IOV was configured with an InfiniBand (IB) type. This unbinding workflow removed the IB interface from the node, and this sitaution made the IB interface non-functional. With this release, a fix to the SR-IOV network config daemon ensures that the IB interface remains functional when it correctly unbinds the VF network interface. Additionally, the SR-IOV Network Operator targets the network drivers of a VF interface instead of the PF interface when configuring SR-IOV with an IB type. (link:https://issues.redhat.com/browse/OCPBUGS-53254[*OCPBUGS-53254*])
      Show
      * Previously, the Single Root I/O Virtualization (SR-IOV) network config daemon unbinded network drivers from the physical function (PF) interface instead of unbinding the drivers from the virtual function (VF) interface when SR-IOV was configured with an InfiniBand (IB) type. This unbinding workflow removed the IB interface from the node, and this sitaution made the IB interface non-functional. With this release, a fix to the SR-IOV network config daemon ensures that the IB interface remains functional when it correctly unbinds the VF network interface. Additionally, the SR-IOV Network Operator targets the network drivers of a VF interface instead of the PF interface when configuring SR-IOV with an IB type. (link: https://issues.redhat.com/browse/OCPBUGS-53254 [* OCPBUGS-53254 *])
    • None
    • None
    • None
    • None

      Description of problem:

       cu is observing node reboot churn when configuring IB IFsVFs in 4.17.z: the IF/VF gets configured (sans the VF GUID), but the cu observed abt 24h of reboot churns before everything was stabilised; cu believes this is a regression between 4.16.z/4.17.z due to the lack of a commit associated w the following PRs: 

       

       

       

      the behaviour being seen is reported as follows by the cu:
      Created By: Julien Dethurens  (3/5/2025 9:51 AM [MT])
      ...
      After more than 24 hours of reboots, sk0be-005x reached a state where the virtual functions are configured and the interface was not lost. I uploaded a sos report -k crio.all=on -k crio.logs=on  -k podman.all=on -k podman.logs=on -a of the node in this state as sosreport-sk0be-005x-04071984-2025-03-05-gdfdixi.tar.xz and a oc adm must-gather --image-stream=openshift/must-gather --image=registry.redhat.io/openshift4/ose-sriov-operator-must-gather as 04071984_must-gather.local.6300288682719350670.tar.gz. The status.allocatable for the node has openshift.io/sriov_ib: "64" and the test pod was scheduled on it.
      The pod is stuck in ContainerCreating state, with this error appearing in its events and in the logs of the multus-z8cf9 pod (for sk0be-005x) in the openshift-multus namespace: error adding container to network "proj-k8s-ssc-hpci-sriov-ib-network": infiniBand SRI-OV CNI failed to configure VF "VF ibs1f0v18 GUID is not valid".
      Indeed, the VFs all have guid: "00:00:00:00:00:00:00:00" in the SriovNetworkNodeState for sk0be-005x and the node_guid file in sysfs shows 0000:0000:0000:0000:
      sh-5.1# ethtool -i ibs1f0v18
      driver: mlx5_core[ib_ipoib]
      version: 25.01-0.6.0
      firmware-version: 20.42.1000 (MT_0000000453)
      expansion-rom-version: 
      bus-info: 0000:03:02.4
      supports-statistics: yes
      supports-test: yes
      supports-eeprom-access: no
      supports-register-dump: no
      supports-priv-flags: yes
      sh-5.1# cat /sys/bus/pci/devices/0000\:03\:02.4/infiniband/mlx5_20/node_guid
      0000:0000:0000:0000
      
      at the time of that comment, another node was in the process of the reboot churn.
      
      i see two things that are concerning, the node reboot churn AND the subsequent failure of a pod to use a VF due to an invalid GUID. the cu believes this to be a regression, but dont have the knowledge to validate that and so ask it here.

       

       

      Version-Release number of selected component (if applicable):

      4.17.z

      How reproducible:

      unsure

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

      node goes into reboot churn resulting in IB VFs lacking valid GUID and so are not attachable

      Expected results:

      VFs get configured a valid GUID and node does not spend a day in a reboot churn

      Additional info:

          

              wizhao@redhat.com William Zhao
              rhn-support-mleonard milti leonard
              None
              None
              Zhiqiang Fang Zhiqiang Fang
              None
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: