Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-30875

VF GUID Configuration in Nvidia Infiniband not being detected by SRIOV

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Do
    • Major
    • None
    • 4.12
    • Networking / SR-IOV
    • None
    • Moderate
    • No
    • CNF Network Sprint 251
    • 1
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      After configuring this use case (Nvidia Infiniband on a Red Hat OpenShift connected or air-gapped cluster: https://www.redhat.com/en/blog/nvidia-infiniband-red-hat-openshift-connected-or-air-gapped-cluster), so activating MOFED using NVIDIA. There is an issue with vf GUID, when the mofed driver loads it gives all the vf 0, the SRIOV operator doesn't get the correct GUID configuration. 
      The SRIOV operator should catch the network cards, configure VFs and assign GUID on the cards. There is possible race condition, where the MOFED pods are coming up, only if restart of the pods of the SRIOV operator (sriov-network-config-daemon and sriov-device-plugin) then the pods are working well.
      It looks like the SRIOV pods are checking the PCI cards only when coming up, and not during runtime.
      Currently the system is working after restarting the pods.
      

      Version-Release number of selected component (if applicable):

          4.12

      How reproducible:

      We can not collect SRIOV must-gather (high security infra) nor we do have labs for checking further to reproduce from our end in support.    

      Actual results:

      As temporary workaround the team needs to restart the SRIOV pods manually during mofed pod loading and the vf gets the correct GUID.    

      Expected results:

          Not needed to restart pods.

      Additional info:

          OCP 4.12 , DGX of 8xA100-80G each server

      Attachments

        Activity

          People

            sscheink@redhat.com Sebastian Scheinkman
            rhn-support-dahernan David Hernandez Fernandez
            Zhanqi Zhao Zhanqi Zhao
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: