Loading...

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: rhelai-1.3
Affects Version/s: rhelai-1.3
Component/s: None
Labels:
- blocker

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Release Blocker:
Approved

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Executing the 'ilab system info' command displays the following message:

/opt/app-root/lib64/python3.11/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at /mount/work-dir/torch-2.4.1/torch-2.4.1/c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0

Full command's output:

[cloud-user@nvd-srv-30 ~]$ ilab system info
/opt/app-root/lib64/python3.11/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at /mount/work-dir/torch-2.4.1/torch-2.4.1/c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
Platform:
  sys.version: 3.11.7 (main, Oct  9 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
  sys.platform: linux
  os.name: posix
  platform.release: 5.14.0-427.42.1.el9_4.x86_64
  platform.machine: x86_64
  platform.node: nvd-srv-30.nvidia.eng.rdu2.dc.redhat.com
  platform.python_version: 3.11.7
  os-release.ID: rhel
  os-release.VERSION_ID: 9.4
  os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
  memory.total: 250.89 GB
  memory.available: 239.19 GB
  memory.used: 9.92 GBInstructLab:
  instructlab.version: 0.21.0
  instructlab-dolomite.version: 0.2.0
  instructlab-eval.version: 0.4.1
  instructlab-quantize.version: 0.1.0
  instructlab-schema.version: 0.4.1
  instructlab-sdg.version: 0.6.0
  instructlab-training.version: 0.6.1Torch:
  torch.version: 2.4.1
  torch.backends.cpu.capability: AVX512
  torch.version.cuda: 12.4
  torch.version.hip: None
  torch.cuda.available: False
  torch.backends.cuda.is_built: True
  torch.backends.mps.is_built: False
  torch.backends.mps.is_available: Falsellama_cpp_python:
  llama_cpp_python.version: 0.2.79
  llama_cpp_python.supports_gpu_offload: True

I haven't tried the other ilab commands yet, but I believe some of them will behave similarly.

[cloud-user@nvd-srv-30 ~]$ podman images --format json | jq .[0]
{
  "Id": "d220151f422a897870ab6e5c2ef7f1f41269a9cbf360412ee92274e89f1736f3",
  "ParentId": "",
  "RepoTags": null,
  "RepoDigests": [
    "quay.io/redhat-user-workloads/rhel-ai-tenant/instructlab-nvidia-1-3/instructlab-nvidia-1-3@sha256:280f035175cf708f18daba9a1475643b4e7784c87a7341be44e9d2e63e970616",
    "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9@sha256:280f035175cf708f18daba9a1475643b4e7784c87a7341be44e9d2e63e970616"
  ],
  "Size": 17739606765,
  "SharedSize": 0,
  "VirtualSize": 17739606765,
  "Labels": {
    "WHEEL_RELEASE": "v1.3.1031+rhelai-cuda-ubi9",
    "architecture": "x86_64",
    "build-date": "2024-11-20T10:44:15",
    "com.redhat.component": "ubi9-container",
    "com.redhat.license_terms": "https://www.redhat.com/en/about/red-hat-end-user-license-agreements#UBI",
    "description": "The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.",
    "distribution-scope": "public",
    "io.buildah.version": "1.38.0-dev",
    "io.k8s.description": "The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.",
    "io.k8s.display-name": "Red Hat Universal Base Image 9",
    "io.openshift.expose-services": "",
    "io.openshift.tags": "base rhel9",
    "maintainer": "Red Hat, Inc.",
    "name": "ubi9",
    "org.opencontainers.image.vendor": "Red Hat, Inc.",
    "release": "1214.1729773476",
    "summary": "Provides the latest release of Red Hat Universal Base Image 9.",
    "url": "https://access.redhat.com/containers/#/registry.access.redhat.com/ubi9/images/9.4-1214.1729773476",
    "vcs-ref": "cc63f67f3cace7a256ee1a8fefafbb37a9d07f05",
    "vcs-type": "git",
    "vendor": "Red Hat, Inc.",
    "version": "9.4"
  },
  "Containers": 0,
  "Names": [
    "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3-1730980057",
    "quay.io/redhat-user-workloads/rhel-ai-tenant/instructlab-nvidia-1-3/instructlab-nvidia-1-3:cc63f67f3cace7a256ee1a8fefafbb37a9d07f05"
  ],
  "Digest": "sha256:280f035175cf708f18daba9a1475643b4e7784c87a7341be44e9d2e63e970616",
  "History": [
    "quay.io/redhat-user-workloads/rhel-ai-tenant/instructlab-nvidia-1-3/instructlab-nvidia-1-3:cc63f67f3cace7a256ee1a8fefafbb37a9d07f05",
    "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3-1730980057"
  ],
  "Created": 1732100899,
  "CreatedAt": "2024-11-20T11:08:19Z"
}

[cloud-user@nvd-srv-30 ~]$ sudo bootc status
apiVersion: org.containers.bootc/v1alpha1
kind: BootcHost
metadata:
  name: host
spec:
  image:
    image: registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.1
    transport: registry
  bootOrder: default
status:
  staged: null
  booted:
    image:
      image:
        image: registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.1
        transport: registry
      version: 9.20241104.0
      timestamp: null
      imageDigest: sha256:34cb007e44dc3a3e5c98a07d02b557ff37b1bdacc39a1d0aad25db339b6624ee
    cachedUpdate: null
    incompatible: false
    pinned: false
    store: ostreeContainer
    ostree:
      checksum: 4ac3ef66fe4957df890d742ae1bdf2a54affc62bce074b689d4b8c3389d4e2f4
      deploySerial: 0
  rollback: null
  rollbackQueued: false
  type: bootcHost

Host is Dell R760xa

[cloud-user@nvd-srv-30 ~]$ nvidia-smi
Fri Nov 22 20:21:25 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4A:00.0 Off |                    0 |
| N/A   27C    P8             30W /  350W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L40S                    On  |   00000000:61:00.0 Off |                    0 |
| N/A   26C    P8             30W /  350W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L40S                    On  |   00000000:CA:00.0 Off |                    0 |
| N/A   25C    P8             31W /  350W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L40S                    On  |   00000000:E1:00.0 Off |                    0 |
| N/A   23C    P8             19W /  350W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------++-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+