Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-4052

ilab model serve failed on Intel Gaudi with "Specified --gpus value (8) exceeds available GPUs (0)."

XMLWordPrintable

    • Approved

      To Reproduce Steps to reproduce the behavior:

      Run : ilab model serve --model-path ~/.cache/instructlab/models/granite-8b-lab-v1

      [root@g3-srv15-c03b-idc ~]# ilab model serve --model-path ~/.cache/instructlab/models/granite-8b-lab-v1
      INFO 2025-05-02 16:36:38,799 instructlab.model.serve_backend:79: Setting backend_type in the serve config to vllm
      INFO 2025-05-02 16:36:38,816 instructlab.model.serve_backend:85: Using model '/root/.cache/instructlab/models/granite-8b-lab-v1' with -1 gpu-layers and 4096 max context size.
      ERROR 2025-05-02 16:36:38,817 instructlab.model.serve_backend:120: Specified --gpus value (8) exceeds available GPUs (0).
      Please specify a valid number of GPUs.
      [root@g3-srv15-c03b-idc ~]#

      Device Info (please complete the following information):

      • Hardware Specs: Intel Gaudi 3 Server (Intel SDP Platform) with 8 Accelerator (see hl-smi output attached.)
      • OS Version: Red Hat Enterprise Linux 9.4 / RHEl AI 1.5
      • InstructLab Version: ilab, version 0.26.0a1
      • Provide the output of these two commands:
        • sudo bootc status --format json | jq .status.booted.image.image.image 

      [root@g3-srv15-c03b-idc ~]# sudo bootc status --format json | jq .status.booted.image.image.image
      "registry.stage.redhat.io/rhelai1/bootc-intel-rhel9:1.5-1746033450"
      [root@g3-srv15-c03b-idc ~]#

       

        • ilab system info to print detailed information about InstructLab version, OS, and hardware – including GPU / AI accelerator hardware

      -----------------------{}{}: System Configuration :{}{}------------------------
      Num CPU Cores : 224
      CPU RAM : 1056269984 KB
      ------------------------------------------------------------------------------
      Platform:
      sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
      sys.platform: linux
      os.name: posix
      platform.release: 5.14.0-427.62.1.el9_4.x86_64
      platform.machine: x86_64
      platform.node: g3-srv15-c03b-idc
      platform.python_version: 3.11.7
      os-release.ID: rhel
      os-release.VERSION_ID: 9.4
      os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
      memory.total: 1007.34 GB
      memory.available: 993.65 GB
      memory.used: 8.85 GB
      See attached file for compete output

      Bug impact

      • Not able to serve a model

      Known workaround

      • None

      Additional context

      • The system was update from RHEL AI 1.4 to RHEl AI 1.5

      bootc switch registry.stage.redhat.io/rhelai1/bootc-intel-rhel9:1.5-1746033

      • ilab config init was re-run

              cdoern@redhat.com Charles Doern
              brault@redhat.com Bertrand Rault
              Charles Doern
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: