-
Bug
-
Resolution: Done
-
Critical
-
rhelai-1.3
-
None
-
False
-
-
False
-
-
-
Approved
Executing the 'ilab system info' command displays the following message:
/opt/app-root/lib64/python3.11/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at /mount/work-dir/torch-2.4.1/torch-2.4.1/c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
Full command's output:
[cloud-user@nvd-srv-30 ~]$ ilab system info
/opt/app-root/lib64/python3.11/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at /mount/work-dir/torch-2.4.1/torch-2.4.1/c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
Platform:
sys.version: 3.11.7 (main, Oct 9 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
sys.platform: linux
os.name: posix
platform.release: 5.14.0-427.42.1.el9_4.x86_64
platform.machine: x86_64
platform.node: nvd-srv-30.nvidia.eng.rdu2.dc.redhat.com
platform.python_version: 3.11.7
os-release.ID: rhel
os-release.VERSION_ID: 9.4
os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
memory.total: 250.89 GB
memory.available: 239.19 GB
memory.used: 9.92 GBInstructLab:
instructlab.version: 0.21.0
instructlab-dolomite.version: 0.2.0
instructlab-eval.version: 0.4.1
instructlab-quantize.version: 0.1.0
instructlab-schema.version: 0.4.1
instructlab-sdg.version: 0.6.0
instructlab-training.version: 0.6.1Torch:
torch.version: 2.4.1
torch.backends.cpu.capability: AVX512
torch.version.cuda: 12.4
torch.version.hip: None
torch.cuda.available: False
torch.backends.cuda.is_built: True
torch.backends.mps.is_built: False
torch.backends.mps.is_available: Falsellama_cpp_python:
llama_cpp_python.version: 0.2.79
llama_cpp_python.supports_gpu_offload: True
I haven't tried the other ilab commands yet, but I believe some of them will behave similarly.
[cloud-user@nvd-srv-30 ~]$ podman images --format json | jq .[0]
{
"Id": "d220151f422a897870ab6e5c2ef7f1f41269a9cbf360412ee92274e89f1736f3",
"ParentId": "",
"RepoTags": null,
"RepoDigests": [
"quay.io/redhat-user-workloads/rhel-ai-tenant/instructlab-nvidia-1-3/instructlab-nvidia-1-3@sha256:280f035175cf708f18daba9a1475643b4e7784c87a7341be44e9d2e63e970616",
"registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9@sha256:280f035175cf708f18daba9a1475643b4e7784c87a7341be44e9d2e63e970616"
],
"Size": 17739606765,
"SharedSize": 0,
"VirtualSize": 17739606765,
"Labels": {
"WHEEL_RELEASE": "v1.3.1031+rhelai-cuda-ubi9",
"architecture": "x86_64",
"build-date": "2024-11-20T10:44:15",
"com.redhat.component": "ubi9-container",
"com.redhat.license_terms": "https://www.redhat.com/en/about/red-hat-end-user-license-agreements#UBI",
"description": "The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.",
"distribution-scope": "public",
"io.buildah.version": "1.38.0-dev",
"io.k8s.description": "The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.",
"io.k8s.display-name": "Red Hat Universal Base Image 9",
"io.openshift.expose-services": "",
"io.openshift.tags": "base rhel9",
"maintainer": "Red Hat, Inc.",
"name": "ubi9",
"org.opencontainers.image.vendor": "Red Hat, Inc.",
"release": "1214.1729773476",
"summary": "Provides the latest release of Red Hat Universal Base Image 9.",
"url": "https://access.redhat.com/containers/#/registry.access.redhat.com/ubi9/images/9.4-1214.1729773476",
"vcs-ref": "cc63f67f3cace7a256ee1a8fefafbb37a9d07f05",
"vcs-type": "git",
"vendor": "Red Hat, Inc.",
"version": "9.4"
},
"Containers": 0,
"Names": [
"registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3-1730980057",
"quay.io/redhat-user-workloads/rhel-ai-tenant/instructlab-nvidia-1-3/instructlab-nvidia-1-3:cc63f67f3cace7a256ee1a8fefafbb37a9d07f05"
],
"Digest": "sha256:280f035175cf708f18daba9a1475643b4e7784c87a7341be44e9d2e63e970616",
"History": [
"quay.io/redhat-user-workloads/rhel-ai-tenant/instructlab-nvidia-1-3/instructlab-nvidia-1-3:cc63f67f3cace7a256ee1a8fefafbb37a9d07f05",
"registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3-1730980057"
],
"Created": 1732100899,
"CreatedAt": "2024-11-20T11:08:19Z"
}
[cloud-user@nvd-srv-30 ~]$ sudo bootc status apiVersion: org.containers.bootc/v1alpha1 kind: BootcHost metadata: name: host spec: image: image: registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.1 transport: registry bootOrder: default status: staged: null booted: image: image: image: registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.1 transport: registry version: 9.20241104.0 timestamp: null imageDigest: sha256:34cb007e44dc3a3e5c98a07d02b557ff37b1bdacc39a1d0aad25db339b6624ee cachedUpdate: null incompatible: false pinned: false store: ostreeContainer ostree: checksum: 4ac3ef66fe4957df890d742ae1bdf2a54affc62bce074b689d4b8c3389d4e2f4 deploySerial: 0 rollback: null rollbackQueued: false type: bootcHost
Host is Dell R760xa
[cloud-user@nvd-srv-30 ~]$ nvidia-smi
Fri Nov 22 20:21:25 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L40S On | 00000000:4A:00.0 Off | 0 |
| N/A 27C P8 30W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA L40S On | 00000000:61:00.0 Off | 0 |
| N/A 26C P8 30W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA L40S On | 00000000:CA:00.0 Off | 0 |
| N/A 25C P8 31W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA L40S On | 00000000:E1:00.0 Off | 0 |
| N/A 23C P8 19W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------++-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
- mentioned on
(2 mentioned on)