Loading...

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: rhelai-1.4.3, rhelai-1.5
Affects Version/s: rhelai-1.4, rhelai-1.4.1
Component/s: InstructLab - SDG
Labels:
- Regression

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Git Pull Request:
https://github.com/instructlab/sdg/pull/549
Intelligence Requested:
Market:

Severity:
Important

Release Blocker:
Approved
Target Version:

rhelai-1.4.3

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

To Reproduce Steps to reproduce the behavior:

Deploy RHEL AI 1.4.x onto a server with enough resources to complete the SDG run, initializing ilab correctly
Clone a taxonomy with reference documents in markdown with tables, such as https://github.com/jharmison-redhat/etx-bofa
Run `ilab data generate` with that taxonomy either in `~/.local/share/instructlab/taxonomy` or referenced on the command line.
Observe that SDG fails during document processing/chunking with:

failed to generate data with exception: list index out of range

Expected behavior

SDG pipelines to run successfully

Device Info (please complete the following information):

Hardware Specs: Dell XE9680 with 8xA100 GPUs
OS Version: RHEL AI 1.4.1
InstructLab Version: ilab, version 0.23.2
Provide the output of these two commands:
- sudo bootc status --format json | jq .status.booted.image.image.image

"registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.4"

- ilab system info

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 8 CUDA devices: Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes Device 2: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes Device 3: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes Device 4: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes Device 5: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes Device 6: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes Device 7: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes Platform: sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] sys.platform: linux os.name: posix platform.release: 5.14.0-427.50.2.el9_4.x86_64 platform.machine: x86_64 platform.node: rhelai-1 platform.python_version: 3.11.7 os-release.ID: rhel os-release.VERSION_ID: 9.4 os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow) memory.total: 503.39 GB memory.available: 422.36 GB memory.used: 29.59 GB InstructLab: instructlab.version: 0.23.2 instructlab-dolomite.version: 0.2.0 instructlab-eval.version: 0.5.1 instructlab-quantize.version: 0.1.0 instructlab-schema.version: 0.4.2 instructlab-sdg.version: 0.7.1 instructlab-training.version: 0.7.0 Torch: torch.version: 2.5.1 torch.backends.cpu.capability: AVX512 torch.version.cuda: 12.4 torch.version.hip: None torch.cuda.available: True torch.backends.cuda.is_built: True torch.backends.mps.is_built: False torch.backends.mps.is_available: False torch.cuda.bf16: True torch.cuda.current.device: 0 torch.cuda.0.name: NVIDIA A100-SXM4-80GB torch.cuda.0.free: 8.8 GB torch.cuda.0.total: 79.1 GB torch.cuda.0.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.1.name: NVIDIA A100-SXM4-80GB torch.cuda.1.free: 9.6 GB torch.cuda.1.total: 79.1 GB torch.cuda.1.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.2.name: NVIDIA A100-SXM4-80GB torch.cuda.2.free: 9.6 GB torch.cuda.2.total: 79.1 GB torch.cuda.2.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.3.name: NVIDIA A100-SXM4-80GB torch.cuda.3.free: 9.6 GB torch.cuda.3.total: 79.1 GB torch.cuda.3.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.4.name: NVIDIA A100-SXM4-80GB torch.cuda.4.free: 9.6 GB torch.cuda.4.total: 79.1 GB torch.cuda.4.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.5.name: NVIDIA A100-SXM4-80GB torch.cuda.5.free: 9.6 GB torch.cuda.5.total: 79.1 GB torch.cuda.5.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.6.name: NVIDIA A100-SXM4-80GB torch.cuda.6.free: 9.6 GB torch.cuda.6.total: 79.1 GB torch.cuda.6.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.7.name: NVIDIA A100-SXM4-80GB torch.cuda.7.free: 9.9 GB torch.cuda.7.total: 79.1 GB torch.cuda.7.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute) llama_cpp_python: llama_cpp_python.version: 0.3.2 llama_cpp_python.supports_gpu_offload: True

Bug impact

Unable to complete SDG and proceed with training

Known workaround

$ cat Containerfile
FROM registry.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.4.1-1739870750
RUN sed -i '363s/book_element/book_element and book_element["prov"]/' /opt/app-root/lib/python3.11/site-packages/instructlab/sdg/utils/chunkers.py
$ sudo bootc usroverlay
Development mode enabled. A writable overlayfs is now mounted on /usr.
All changes there will be discarded on reboot.
$ podman build . -t localhost/instructlab-nvidia-rhel9:FIXED
[snipped]
$ sudo sed -i 's/^IMAGE_NAME=.*$/IMAGE_NAME=localhost/instructlab-nvidia-rhel9:FIXED/' /usr/bin/ilab

Additional context

There is an existing fix upstream: https://github.com/instructlab/sdg/pull/549
My hacky workaround doesn't survive a reboot, but does implement this fix while the box is running.

mentioned on

Merge request - AIPCC-654: bump release date of 1.4.3

1.

Create SDG v0.7.2 wheels build

Closed

Prarit Bhargava

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Sub-Tasks

Activity

People

Dates