Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-2355

ilab prints 'Unexpected error from cudaGetDeviceCount()' error msg

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • rhelai-1.3
    • rhelai-1.3
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Approved

      Executing the 'ilab system info' command displays the following message:

      /opt/app-root/lib64/python3.11/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at /mount/work-dir/torch-2.4.1/torch-2.4.1/c10/cuda/CUDAFunctions.cpp:108.)
        return torch._C._cuda_getDeviceCount() > 0 

      Full command's output:

      [cloud-user@nvd-srv-30 ~]$ ilab system info
      /opt/app-root/lib64/python3.11/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at /mount/work-dir/torch-2.4.1/torch-2.4.1/c10/cuda/CUDAFunctions.cpp:108.)
        return torch._C._cuda_getDeviceCount() > 0
      Platform:
        sys.version: 3.11.7 (main, Oct  9 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
        sys.platform: linux
        os.name: posix
        platform.release: 5.14.0-427.42.1.el9_4.x86_64
        platform.machine: x86_64
        platform.node: nvd-srv-30.nvidia.eng.rdu2.dc.redhat.com
        platform.python_version: 3.11.7
        os-release.ID: rhel
        os-release.VERSION_ID: 9.4
        os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
        memory.total: 250.89 GB
        memory.available: 239.19 GB
        memory.used: 9.92 GBInstructLab:
        instructlab.version: 0.21.0
        instructlab-dolomite.version: 0.2.0
        instructlab-eval.version: 0.4.1
        instructlab-quantize.version: 0.1.0
        instructlab-schema.version: 0.4.1
        instructlab-sdg.version: 0.6.0
        instructlab-training.version: 0.6.1Torch:
        torch.version: 2.4.1
        torch.backends.cpu.capability: AVX512
        torch.version.cuda: 12.4
        torch.version.hip: None
        torch.cuda.available: False
        torch.backends.cuda.is_built: True
        torch.backends.mps.is_built: False
        torch.backends.mps.is_available: Falsellama_cpp_python:
        llama_cpp_python.version: 0.2.79
        llama_cpp_python.supports_gpu_offload: True 

      I haven't tried the other ilab commands yet, but I believe some of them will behave similarly.

      [cloud-user@nvd-srv-30 ~]$ podman images --format json | jq .[0]
      {
        "Id": "d220151f422a897870ab6e5c2ef7f1f41269a9cbf360412ee92274e89f1736f3",
        "ParentId": "",
        "RepoTags": null,
        "RepoDigests": [
          "quay.io/redhat-user-workloads/rhel-ai-tenant/instructlab-nvidia-1-3/instructlab-nvidia-1-3@sha256:280f035175cf708f18daba9a1475643b4e7784c87a7341be44e9d2e63e970616",
          "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9@sha256:280f035175cf708f18daba9a1475643b4e7784c87a7341be44e9d2e63e970616"
        ],
        "Size": 17739606765,
        "SharedSize": 0,
        "VirtualSize": 17739606765,
        "Labels": {
          "WHEEL_RELEASE": "v1.3.1031+rhelai-cuda-ubi9",
          "architecture": "x86_64",
          "build-date": "2024-11-20T10:44:15",
          "com.redhat.component": "ubi9-container",
          "com.redhat.license_terms": "https://www.redhat.com/en/about/red-hat-end-user-license-agreements#UBI",
          "description": "The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.",
          "distribution-scope": "public",
          "io.buildah.version": "1.38.0-dev",
          "io.k8s.description": "The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.",
          "io.k8s.display-name": "Red Hat Universal Base Image 9",
          "io.openshift.expose-services": "",
          "io.openshift.tags": "base rhel9",
          "maintainer": "Red Hat, Inc.",
          "name": "ubi9",
          "org.opencontainers.image.vendor": "Red Hat, Inc.",
          "release": "1214.1729773476",
          "summary": "Provides the latest release of Red Hat Universal Base Image 9.",
          "url": "https://access.redhat.com/containers/#/registry.access.redhat.com/ubi9/images/9.4-1214.1729773476",
          "vcs-ref": "cc63f67f3cace7a256ee1a8fefafbb37a9d07f05",
          "vcs-type": "git",
          "vendor": "Red Hat, Inc.",
          "version": "9.4"
        },
        "Containers": 0,
        "Names": [
          "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3-1730980057",
          "quay.io/redhat-user-workloads/rhel-ai-tenant/instructlab-nvidia-1-3/instructlab-nvidia-1-3:cc63f67f3cace7a256ee1a8fefafbb37a9d07f05"
        ],
        "Digest": "sha256:280f035175cf708f18daba9a1475643b4e7784c87a7341be44e9d2e63e970616",
        "History": [
          "quay.io/redhat-user-workloads/rhel-ai-tenant/instructlab-nvidia-1-3/instructlab-nvidia-1-3:cc63f67f3cace7a256ee1a8fefafbb37a9d07f05",
          "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3-1730980057"
        ],
        "Created": 1732100899,
        "CreatedAt": "2024-11-20T11:08:19Z"
      }
      [cloud-user@nvd-srv-30 ~]$ sudo bootc status
      apiVersion: org.containers.bootc/v1alpha1
      kind: BootcHost
      metadata:
        name: host
      spec:
        image:
          image: registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.1
          transport: registry
        bootOrder: default
      status:
        staged: null
        booted:
          image:
            image:
              image: registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.1
              transport: registry
            version: 9.20241104.0
            timestamp: null
            imageDigest: sha256:34cb007e44dc3a3e5c98a07d02b557ff37b1bdacc39a1d0aad25db339b6624ee
          cachedUpdate: null
          incompatible: false
          pinned: false
          store: ostreeContainer
          ostree:
            checksum: 4ac3ef66fe4957df890d742ae1bdf2a54affc62bce074b689d4b8c3389d4e2f4
            deploySerial: 0
        rollback: null
        rollbackQueued: false
        type: bootcHost 

      Host is Dell R760xa

      [cloud-user@nvd-srv-30 ~]$ nvidia-smi
      Fri Nov 22 20:21:25 2024
      +-----------------------------------------------------------------------------------------+
      | NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
      |-----------------------------------------+------------------------+----------------------+
      | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
      |                                         |                        |               MIG M. |
      |=========================================+========================+======================|
      |   0  NVIDIA L40S                    On  |   00000000:4A:00.0 Off |                    0 |
      | N/A   27C    P8             30W /  350W |       1MiB /  46068MiB |      0%      Default |
      |                                         |                        |                  N/A |
      +-----------------------------------------+------------------------+----------------------+
      |   1  NVIDIA L40S                    On  |   00000000:61:00.0 Off |                    0 |
      | N/A   26C    P8             30W /  350W |       1MiB /  46068MiB |      0%      Default |
      |                                         |                        |                  N/A |
      +-----------------------------------------+------------------------+----------------------+
      |   2  NVIDIA L40S                    On  |   00000000:CA:00.0 Off |                    0 |
      | N/A   25C    P8             31W /  350W |       1MiB /  46068MiB |      0%      Default |
      |                                         |                        |                  N/A |
      +-----------------------------------------+------------------------+----------------------+
      |   3  NVIDIA L40S                    On  |   00000000:E1:00.0 Off |                    0 |
      | N/A   23C    P8             19W /  350W |       1MiB /  46068MiB |      0%      Default |
      |                                         |                        |                  N/A |
      +-----------------------------------------+------------------------+----------------------++-----------------------------------------------------------------------------------------+
      | Processes:                                                                              |
      |  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
      |        ID   ID                                                               Usage      |
      |=========================================================================================|
      |  No running processes found                                                             |
      +-----------------------------------------------------------------------------------------+ 

              fdupont@redhat.com Fabien Dupont
              aopincar Ariel Opincaru
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

                Created:
                Updated:
                Resolved: