Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-2481

SDG fails on some lower end supported hardware profiles

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • rhelai-1.3
    • None
    • InstructLab - SDG
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Release Notes
    • Known Issue
    • Rejected

      Found on a Dell R760xa machine with x4 L40s Nvidia GPUs.

      Got the following error during SDG run:

      failed to generate data with exception: PipelineBlockError(<class 'instructlab.sdg.llmblock.ConditionalLLMBlock'>/knowledge generation): Request timed out. 

      {+}Machine Info:
      {}
      {+}Disk Image:

      rhel-ai-nvidia-1.3-1732790129-x86_64-boot.iso 
      [cloud-user@nvd-srv-30 ~]$ nvidia-smi
      Mon Dec  2 16:49:04 2024
      +-----------------------------------------------------------------------------------------+
      | NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
      |-----------------------------------------+------------------------+----------------------+
      | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
      |                                         |                        |               MIG M. |
      |=========================================+========================+======================|
      |   0  NVIDIA L40S                    On  |   00000000:4A:00.0 Off |                    0 |
      | N/A   54C    P0            220W /  350W |   43601MiB /  46068MiB |     93%      Default |
      |                                         |                        |                  N/A |
      +-----------------------------------------+------------------------+----------------------+
      |   1  NVIDIA L40S                    On  |   00000000:61:00.0 Off |                    0 |
      | N/A   54C    P0            247W /  350W |   42871MiB /  46068MiB |     89%      Default |
      |                                         |                        |                  N/A |
      +-----------------------------------------+------------------------+----------------------+
      |   2  NVIDIA L40S                    On  |   00000000:CA:00.0 Off |                    0 |
      | N/A   51C    P0            220W /  350W |   42871MiB /  46068MiB |     91%      Default |
      |                                         |                        |                  N/A |
      +-----------------------------------------+------------------------+----------------------+
      |   3  NVIDIA L40S                    On  |   00000000:E1:00.0 Off |                    0 |
      | N/A   50C    P0            209W /  350W |   42871MiB /  46068MiB |     93%      Default |
      |                                         |                        |                  N/A |
      +-----------------------------------------+------------------------+----------------------++-----------------------------------------------------------------------------------------+
      | Processes:                                                                              |
      |  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
      |        ID   ID                                                               Usage      |
      |=========================================================================================|
      |    0   N/A  N/A    189062      C   /opt/app-root/bin/python3.11                  720MiB |
      |    0   N/A  N/A    189145      C   /opt/app-root/bin/python3.11                42858MiB |
      |    1   N/A  N/A    189212      C   /opt/app-root/bin/python3.11                42852MiB |
      |    2   N/A  N/A    189213      C   /opt/app-root/bin/python3.11                42852MiB |
      |    3   N/A  N/A    189214      C   /opt/app-root/bin/python3.11                42852MiB |
      +-----------------------------------------------------------------------------------------+ 
      [cloud-user@nvd-srv-30 ~]$ sudo bootc status
      apiVersion: org.containers.bootc/v1alpha1
      kind: BootcHost
      metadata:
        name: host
      spec:
        image:
          image: registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3
          transport: registry
        bootOrder: default
      status:
        staged: null
        booted:
          image:
            image:
              image: registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3
              transport: registry
            version: 9.20241104.0
            timestamp: null
            imageDigest: sha256:9997ada9611ce65d18e5122eaccc7f6eb034b81856bddb6f731170e9c1137936
          cachedUpdate: null
          incompatible: false
          pinned: false
          store: ostreeContainer
          ostree:
            checksum: 7d39505627d14978efa6879c9f29471e16ff1218d01ffc70667896b325cf0c7b
            deploySerial: 0
        rollback: null
        rollbackQueued: false
        type: bootcHost 
      [cloud-user@nvd-srv-30 ~]$ sudo podman images --format json
      [
          {
              "Id": "19f1e146bd0a1000434952b23a806e926be6a505ff614461df812fe5d7b16d97",
              "ParentId": "",
              "RepoTags": null,
              "RepoDigests": [
                  "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9@sha256:56f6f1febaf23ec7a69cb0b57a065ef5cf8df80108255c981b73508543aba22d"
              ],
              "Size": 18205357805,
              "SharedSize": 0,
              "VirtualSize": 18205357805,
              "Labels": {
                  "WHEEL_RELEASE": "v1.3.1103+rhelai-cuda-ubi9",
                  "architecture": "x86_64",
                  "build-date": "2024-11-28T00:43:39",
                  "com.redhat.component": "ubi9-container",
                  "com.redhat.license_terms": "https://www.redhat.com/en/about/red-hat-end-user-license-agreements#UBI",
                  "description": "The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.",
                  "distribution-scope": "public",
                  "io.buildah.version": "1.38.0-dev",
                  "io.k8s.description": "The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.",
                  "io.k8s.display-name": "Red Hat Universal Base Image 9",
                  "io.openshift.expose-services": "",
                  "io.openshift.tags": "base rhel9",
                  "maintainer": "Red Hat, Inc.",
                  "name": "ubi9",
                  "org.opencontainers.image.vendor": "Red Hat, Inc.",
                  "release": "1214.1729773476",
                  "summary": "Provides the latest release of Red Hat Universal Base Image 9.",
                  "url": "https://access.redhat.com/containers/#/registry.access.redhat.com/ubi9/images/9.4-1214.1729773476",
                  "vcs-ref": "91f8ea81d7dacb5d4bee8106fca510214d37ecf5",
                  "vcs-type": "git",
                  "vendor": "Red Hat, Inc.",
                  "version": "9.4"
              },
              "Containers": 1,
              "ReadOnly": true,
              "Names": [
                  "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3-1732754619"
              ],
              "Digest": "sha256:56f6f1febaf23ec7a69cb0b57a065ef5cf8df80108255c981b73508543aba22d",
              "History": [
                  "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3-1732754619"
              ],
              "Created": 1732755754,
              "CreatedAt": "2024-11-28T01:02:34Z"
          }
      ] 

              bbrownin@redhat.com Ben Browning
              aopincar Ariel Opincaru
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: