Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-2359

RHELAI 1.3 Nvidia: "error executing hook `/usr/bin/nvidia-cdi-hook` (exit code: 1)" on both IBM and GCP while running ilab

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Critical
    • Proposed

      To Reproduce Steps to reproduce the behavior:

      1. Download
        1. GCP ( rhel-ai-nvidia-gcp-1.3-1732181423-x86_64.tar.gz ) or
        2. IBM ( rhel-ai-nvidia-ibm-1.3-1732180472-x86_64.qcow2)  cloud images from Stage ( https://downloads.corp.stage.redhat.com/internal/product/932/ver=1.3/rhel---9/1.3/rhelai-1.3-for-rhel-9-x86_64-isos/x86_64/cs-downloads )
      2. Launch instances from those images in the respective clouds
      3. Connect on the machines and run `ilab --version`
      4. IBM CLoud:
        1.    booted:
              image:
                image:
                  image: registry.stage.redhat.io/rhelai1/bootc-ibm-nvidia-rhel9:1.3
                  transport: registry
                version: 9.20241019.0
                timestamp: null
                imageDigest: sha256:d206e78961c947e8681ad1302a5e4d80139adb07c6dc123705c2d2599d236cdf
        1. [cloud-user@ecosystem-qe-1l40s ~]$ cat /usr/bin/ilab | grep IMAGE_NAME 
          IMAGE_NAME="registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3-1730980057"
          
        1. [cloud-user@ecosystem-qe-1l40s ~]$ ilab --version
          Error: OCI runtime error: crun: {"msg":"error executing hook `/usr/bin/nvidia-cdi-hook` (exit code: 1)","level":"error","time":"2024-11-22T14:07:25.690967Z"}
      1. GCP Cloud:
        1.  status:
            staged: null
            booted:
              image:
                image:
                  image: registry.stage.redhat.io/rhelai1/bootc-gcp-nvidia-rhel9:1.3
                  transport: registry
                version: 9.20241019.0
                timestamp: null
                imageDigest: sha256:57c9dc79f3fc2704d478a595748ebe3f2b37bf00633eac864717009f6e7859e4
              cachedUpdate: null
              incompatible: false
              pinned: false
              store: ostreeContainer
              ostree:
                checksum: 58e36e8c5a16e5afaa9b7a25c515180d383a25d2bfb2e12103637733e201c493
                deploySerial: 0
            rollback:
              image:
                image:
                  image: registry.stage.redhat.io/rhelai1/bootc-gcp-nvidia-rhel9:1.3
                  transport: registry
                version: 9.20241019.0
                timestamp: null
                imageDigest: sha256:32764676dda5a631c05582cbde0eff5777869f0931e61c8ea0730f70484d8240
              cachedUpdate:
        1.  $ cat /usr/bin/ilab | grep IMAGE_NAME 
          IMAGE_NAME="registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3-1730980057"
        1. [cloud-user@ecosystem-qe bin]$ /usr/bin/ilab --version
          Error: OCI runtime error: crun: {"msg":"error executing hook `/usr/bin/nvidia-cdi-hook` (exit code: 1)","level":"error","time":"2024-11-22T14:13:07.709819Z"}

      Expected behavior

      • <your text here>

      Screenshots

      • Attached Image 

      Device Info (please complete the following information):

      • Hardware Specs: [e.g. Apple M2 Pro Chip, 16 GB Memory, etc.]
      • OS Version: [e.g. Mac OS 14.4.1, Fedora Linux 40]
      • Python Version: [output of \\\{{{}python --version{}}}]
      • InstructLab Version: [output of \\\{{{}ilab --version{}}}]

      Additional context

      • <your text here>
      • ...

              fdupont@redhat.com Fabien Dupont
              cvultur@redhat.com Constantin Daniel Vultur
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: