-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
None
-
False
-
-
False
-
Release Notes
-
Known Issue
-
-
-
Rejected
Found on a Dell R760xa machine with x4 L40s Nvidia GPUs.
Got the following error during SDG run:
failed to generate data with exception: PipelineBlockError(<class 'instructlab.sdg.llmblock.ConditionalLLMBlock'>/knowledge generation): Request timed out.
{+}Machine Info:
{}
{+}Disk Image:
rhel-ai-nvidia-1.3-1732790129-x86_64-boot.iso
[cloud-user@nvd-srv-30 ~]$ nvidia-smi
Mon Dec 2 16:49:04 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L40S On | 00000000:4A:00.0 Off | 0 |
| N/A 54C P0 220W / 350W | 43601MiB / 46068MiB | 93% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA L40S On | 00000000:61:00.0 Off | 0 |
| N/A 54C P0 247W / 350W | 42871MiB / 46068MiB | 89% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA L40S On | 00000000:CA:00.0 Off | 0 |
| N/A 51C P0 220W / 350W | 42871MiB / 46068MiB | 91% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA L40S On | 00000000:E1:00.0 Off | 0 |
| N/A 50C P0 209W / 350W | 42871MiB / 46068MiB | 93% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------++-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 189062 C /opt/app-root/bin/python3.11 720MiB |
| 0 N/A N/A 189145 C /opt/app-root/bin/python3.11 42858MiB |
| 1 N/A N/A 189212 C /opt/app-root/bin/python3.11 42852MiB |
| 2 N/A N/A 189213 C /opt/app-root/bin/python3.11 42852MiB |
| 3 N/A N/A 189214 C /opt/app-root/bin/python3.11 42852MiB |
+-----------------------------------------------------------------------------------------+
[cloud-user@nvd-srv-30 ~]$ sudo bootc status apiVersion: org.containers.bootc/v1alpha1 kind: BootcHost metadata: name: host spec: image: image: registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3 transport: registry bootOrder: default status: staged: null booted: image: image: image: registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3 transport: registry version: 9.20241104.0 timestamp: null imageDigest: sha256:9997ada9611ce65d18e5122eaccc7f6eb034b81856bddb6f731170e9c1137936 cachedUpdate: null incompatible: false pinned: false store: ostreeContainer ostree: checksum: 7d39505627d14978efa6879c9f29471e16ff1218d01ffc70667896b325cf0c7b deploySerial: 0 rollback: null rollbackQueued: false type: bootcHost
[cloud-user@nvd-srv-30 ~]$ sudo podman images --format json [ { "Id": "19f1e146bd0a1000434952b23a806e926be6a505ff614461df812fe5d7b16d97", "ParentId": "", "RepoTags": null, "RepoDigests": [ "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9@sha256:56f6f1febaf23ec7a69cb0b57a065ef5cf8df80108255c981b73508543aba22d" ], "Size": 18205357805, "SharedSize": 0, "VirtualSize": 18205357805, "Labels": { "WHEEL_RELEASE": "v1.3.1103+rhelai-cuda-ubi9", "architecture": "x86_64", "build-date": "2024-11-28T00:43:39", "com.redhat.component": "ubi9-container", "com.redhat.license_terms": "https://www.redhat.com/en/about/red-hat-end-user-license-agreements#UBI", "description": "The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.", "distribution-scope": "public", "io.buildah.version": "1.38.0-dev", "io.k8s.description": "The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.", "io.k8s.display-name": "Red Hat Universal Base Image 9", "io.openshift.expose-services": "", "io.openshift.tags": "base rhel9", "maintainer": "Red Hat, Inc.", "name": "ubi9", "org.opencontainers.image.vendor": "Red Hat, Inc.", "release": "1214.1729773476", "summary": "Provides the latest release of Red Hat Universal Base Image 9.", "url": "https://access.redhat.com/containers/#/registry.access.redhat.com/ubi9/images/9.4-1214.1729773476", "vcs-ref": "91f8ea81d7dacb5d4bee8106fca510214d37ecf5", "vcs-type": "git", "vendor": "Red Hat, Inc.", "version": "9.4" }, "Containers": 1, "ReadOnly": true, "Names": [ "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3-1732754619" ], "Digest": "sha256:56f6f1febaf23ec7a69cb0b57a065ef5cf8df80108255c981b73508543aba22d", "History": [ "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3-1732754619" ], "Created": 1732755754, "CreatedAt": "2024-11-28T01:02:34Z" } ]
- is duplicated by
-
RHELAI-2490 SDG fails on g6. `failed to generate data with exception: PipelineBlockError(<class 'instructlab.sdg.llmblock.ConditionalLLMBlock'>/knowledge generation): Request timed out.`
- Closed
- links to