-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.10
-
Important
-
None
-
Rejected
-
x86_64
-
If docs needed, set a value
Description of problem:
This is a regression from GPU Operator v1.9.1 which deployed successfully along with its ClusterPolicy on OCP 4.10.z IPI cluster in Google Cloud instance with A100 GPU.
This is on the worker node with instance type "a2-highgpu-1g" with A100 NVIDIA GPU:
GPU 0: NVIDIA A100-SXM4-40GB
GPu operator v1.10.0 fails to deploy successfully on GCP IPI cluster with A100 GPU.
dmesg errors on the GPU enabled worker node
[ 3861.196890] NVRM: GPU 0000:00:04.0: RmInitAdapter failed! (0x63:0x55:2344)
[ 3861.206749] NVRM: GPU 0000:00:04.0: rm_init_adapter failed, device minor number 0
..
[ 399.355629] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 510.47.03 Mon Jan 24 22:58:54 UTC 2022
[pod/nvidia-operator-validator-c69mx/driver-validation] running command chroot with args [/run/nvidia/driver nvidia-smi]
[pod/nvidia-operator-validator-c69mx/driver-validation] No devices were found
Version-Release number of selected component (if applicable):
- OCP 4.10.8
- nodes on kubernetes v1.23.5+1f952b3
How reproducible:
Every time
Steps to Reproduce:
1. Create an OCP IPI in Google Cloud 3 master 3 worker nodes
2. Create a new machineset with:
providerSpec:
value:
apiVersion: machine.openshift.io/v1beta1
canIPForward: false
credentialsSecret:
name: gcp-cloud-credentials
deletionProtection: false
.
.
kind: GCPMachineProviderSpec
machineType: a2-highgpu-1g
onHostMaintenance: Terminate
3. oc create -f new-a2-highgpu-1g-machineset.yaml
4. Deploy NFD from operatorHub and create and instance of NFD operand
5. Deploy GPU operator from OperatorHub, then create ClusterPolicy taking all defaults
Actual results:
- oc get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-659sg 0/1 Init:0/1 0 105m
gpu-operator-64b6f555df-59bkj 1/1 Running 13 (5m57s ago) 106m
nvidia-container-toolkit-daemonset-9f5nc 0/1 Init:0/1 0 105m
nvidia-dcgm-exporter-nrws9 0/1 Init:0/2 0 105m
nvidia-dcgm-p9b4b 0/1 Init:0/1 0 105m
nvidia-device-plugin-daemonset-h5dfz 0/1 Init:0/1 0 105m
nvidia-driver-daemonset-410.84.202203290245-0-8dv52 2/2 Running 0 105m
nvidia-node-status-exporter-2brwd 1/1 Running 0 105m
nvidia-operator-validator-c69mx 0/1 Init:0/4 0 105m
Expected results:
This is taken when deploying GPU operator v1.9.1:
- oc get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-p4s2k 1/1 Running 0 4m7s
gpu-operator-576df6dfc9-9xf9w 1/1 Running 0 7m24s
nvidia-container-toolkit-daemonset-q2ztk 1/1 Running 0 4m7s
nvidia-cuda-validator-m9f99 0/1 Completed 0 70s
nvidia-dcgm-b6mkx 1/1 Running 0 4m7s
nvidia-dcgm-exporter-ng56n 1/1 Running 0 4m7s
nvidia-device-plugin-daemonset-tbkbg 1/1 Running 0 4m7s
nvidia-device-plugin-validator-fhfns 0/1 Completed 0 54s
nvidia-driver-daemonset-410.84.202203290245-0-jqrtk 2/2 Running 0 4m50s
nvidia-mig-manager-6k48n 1/1 Running 0 23s
nvidia-node-status-exporter-bghx2 1/1 Running 0 4m51s
nvidia-operator-validator-gx925 1/1 Running 0 4m7s
Additional info:
Pod logs and dmesg logs from nvidia-driver-daemonset pod attached