Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-21343

[2128107] sriov-manage command fails to enable SRIOV Virtual functions on the Ampere GPU Cards

XMLWordPrintable

    • Important
    • None

      Description of problem:

      After configuring Nvidia GPU Operator,
      The below pods were not found, for Ampere based GPU Cards.

      nvidia-sandbox-device-plugin-daemonset-5rsv9
      nvidia-sandbox-device-plugin-daemonset-q225z
      nvidia-sandbox-validator-996wt
      nvidia-sandbox-validator-shwj9

      Probably this is because the "lspci" command was not found in the container "openshift-driver-toolkit-ctr" in pod "nvidia-vgpu-manager-daemonset-411.86.202208031059-0"

      [kbidarka@localhost nvidia-gpu-operator]$ oc logs -c openshift-driver-toolkit-ctr -f nvidia-vgpu-manager-daemonset-411.86.202208031059-0-8wfxh | grep -A 5 "sriov-manage"
      + /usr/lib/nvidia/sriov-manage -e ALL
      /usr/lib/nvidia/sriov-manage: line 259: lspci: command not found
      + return 0
      Done, now waiting for signal
      + echo 'Done, now waiting for signal'
      + trap 'echo '\''Caught signal'\''; _shutdown; trap - EXIT; exit' HUP INT QUIT PIPE TERM
      + true

      Version-Release number of selected component (if applicable):

      How reproducible:
      Installing Nvidia GPU Operator on Ampere GPU Architecture.

      Steps to Reproduce:
      1.
      2.
      3.

      Actual results:
      + /usr/lib/nvidia/sriov-manage -e ALL
      /usr/lib/nvidia/sriov-manage: line 259: lspci: command not found

      Expected results:
      + /usr/lib/nvidia/sriov-manage -e ALL

      The above command should run fine.

      Additional info:

      Workaround:
      1) Install pciutils package in the "openshift-driver-toolkit-ctr"
      2) and then label the node with "vgpu.config=<MDEV-TYPE"

      1) oc -n nvidia-gpu-operator exec pod/nvidia-vgpu-manager-daemonset-411.86.202208031059-0-8wfxh -it -c openshift-driver-toolkit-ctr – /bin/sh -euxc 'dnf install -y pciutils; /usr/lib/nvidia/sriov-manage -e ALL' ; oc -n nvidia-gpu-operator exec pod/nvidia-vgpu-manager-daemonset-411.86.202208031059-0-vk7pw -it -c openshift-driver-toolkit-ctr – /bin/sh -euxc 'dnf install -y pciutils; /usr/lib/nvidia/sriov-manage -e ALL'

      2) oc label node node32.redhat.com --overwrite nvidia.com/vgpu.config=A2-2Q ; oc label node node33.redhat.com --overwrite nvidia.com/vgpu.config=A2-2Q

              sgott@redhat.com Stuart Gott
              kbidarka@redhat.com Kedar Bidarkar
              Kedar Bidarkar Kedar Bidarkar
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: