Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-9216

OCP 4.10.8: NVIDIA GPU Operator v1.10.0 fails to deploy successfully on IPI Google Cloud Platform a2-highgpu-1g A100 GPU instance

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.10
    • ISV Operators
    • Important
    • Rejected
    • x86_64
    • If docs needed, set a value

      Description of problem:
      This is a regression from GPU Operator v1.9.1 which deployed successfully along with its ClusterPolicy on OCP 4.10.z IPI cluster in Google Cloud instance with A100 GPU.
      This is on the worker node with instance type "a2-highgpu-1g" with A100 NVIDIA GPU:
      GPU 0: NVIDIA A100-SXM4-40GB
      GPu operator v1.10.0 fails to deploy successfully on GCP IPI cluster with A100 GPU.

      dmesg errors on the GPU enabled worker node
      [ 3861.196890] NVRM: GPU 0000:00:04.0: RmInitAdapter failed! (0x63:0x55:2344)
      [ 3861.206749] NVRM: GPU 0000:00:04.0: rm_init_adapter failed, device minor number 0
      ..
      [ 399.355629] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 510.47.03 Mon Jan 24 22:58:54 UTC 2022

      [pod/nvidia-operator-validator-c69mx/driver-validation] running command chroot with args [/run/nvidia/driver nvidia-smi]
      [pod/nvidia-operator-validator-c69mx/driver-validation] No devices were found

      Version-Release number of selected component (if applicable):

      • OCP 4.10.8
      • nodes on kubernetes v1.23.5+1f952b3

      How reproducible:
      Every time

      Steps to Reproduce:
      1. Create an OCP IPI in Google Cloud 3 master 3 worker nodes
      2. Create a new machineset with:
      providerSpec:
      value:
      apiVersion: machine.openshift.io/v1beta1
      canIPForward: false
      credentialsSecret:
      name: gcp-cloud-credentials
      deletionProtection: false
      .
      .
      kind: GCPMachineProviderSpec
      machineType: a2-highgpu-1g
      onHostMaintenance: Terminate
      3. oc create -f new-a2-highgpu-1g-machineset.yaml
      4. Deploy NFD from operatorHub and create and instance of NFD operand
      5. Deploy GPU operator from OperatorHub, then create ClusterPolicy taking all defaults

      Actual results:

      1. oc get pods -n nvidia-gpu-operator
        NAME READY STATUS RESTARTS AGE
        gpu-feature-discovery-659sg 0/1 Init:0/1 0 105m
        gpu-operator-64b6f555df-59bkj 1/1 Running 13 (5m57s ago) 106m
        nvidia-container-toolkit-daemonset-9f5nc 0/1 Init:0/1 0 105m
        nvidia-dcgm-exporter-nrws9 0/1 Init:0/2 0 105m
        nvidia-dcgm-p9b4b 0/1 Init:0/1 0 105m
        nvidia-device-plugin-daemonset-h5dfz 0/1 Init:0/1 0 105m
        nvidia-driver-daemonset-410.84.202203290245-0-8dv52 2/2 Running 0 105m
        nvidia-node-status-exporter-2brwd 1/1 Running 0 105m
        nvidia-operator-validator-c69mx 0/1 Init:0/4 0 105m

      Expected results:
      This is taken when deploying GPU operator v1.9.1:

      1. oc get pods -n nvidia-gpu-operator
        NAME READY STATUS RESTARTS AGE
        gpu-feature-discovery-p4s2k 1/1 Running 0 4m7s
        gpu-operator-576df6dfc9-9xf9w 1/1 Running 0 7m24s
        nvidia-container-toolkit-daemonset-q2ztk 1/1 Running 0 4m7s
        nvidia-cuda-validator-m9f99 0/1 Completed 0 70s
        nvidia-dcgm-b6mkx 1/1 Running 0 4m7s
        nvidia-dcgm-exporter-ng56n 1/1 Running 0 4m7s
        nvidia-device-plugin-daemonset-tbkbg 1/1 Running 0 4m7s
        nvidia-device-plugin-validator-fhfns 0/1 Completed 0 54s
        nvidia-driver-daemonset-410.84.202203290245-0-jqrtk 2/2 Running 0 4m50s
        nvidia-mig-manager-6k48n 1/1 Running 0 23s
        nvidia-node-status-exporter-bghx2 1/1 Running 0 4m51s
        nvidia-operator-validator-gx925 1/1 Running 0 4m7s

      Additional info:
      Pod logs and dmesg logs from nvidia-driver-daemonset pod attached

            fdupont@redhat.com Fabien Dupont
            walid@redhat.com Walid Abouhamad
            Walid Abouhamad Walid Abouhamad
            Red Hat Employee
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: