Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Do
Priority: Undefined
Fix Version/s: None
Affects Version/s: rhelai-1.4.5
Component/s: InstructLab - Training, RHELAI - IBMCloud
Labels:
- 3.0-candidate

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

To Reproduce Steps to reproduce the behavior:

ilab train --strategy lab-multiphase

Expected behavior

Screenshots

Device Info (please complete the following information):

Hardware Specs: IBM Cloud 8 H100 GPU instance
OS Version:
InstructLab Version: 0.23.5
Provide the output of these two commands:
- "registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.4.5-1746558530"

- Device 7: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes

Platform:

sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]

sys.platform: linux

os.name: posix

platform.release: 5.14.0-427.55.1.el9_4.x86_64

platform.machine: x86_64

platform.node: ecosystem-qe-h100-de-145

platform.python_version: 3.11.7

os-release.ID: rhel

os-release.VERSION_ID: 9.4

os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)

memory.total: 1763.83 GB

memory.available: 1751.75 GB

memory.used: 3.67 GB

InstructLab:

instructlab.version: 0.23.5

instructlab-dolomite.version: 0.2.0

instructlab-eval.version: 0.5.1

instructlab-quantize.version: 0.1.0

instructlab-schema.version: 0.4.2

instructlab-sdg.version: 0.7.3

instructlab-training.version: 0.7.0

Torch:

torch.version: 2.5.1

torch.backends.cpu.capability: AVX512

torch.version.cuda: 12.4

torch.version.hip: None

torch.cuda.available: True

torch.backends.cuda.is_built: True

torch.backends.mps.is_built: False

torch.backends.mps.is_available: False

torch.cuda.bf16: True

torch.cuda.current.device: 0

torch.cuda.0.name: NVIDIA H100 80GB HBM3

torch.cuda.0.free: 78.6 GB

torch.cuda.0.total: 79.1 GB

torch.cuda.0.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.1.name: NVIDIA H100 80GB HBM3

torch.cuda.1.free: 78.6 GB

torch.cuda.1.total: 79.1 GB

torch.cuda.1.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.2.name: NVIDIA H100 80GB HBM3

torch.cuda.2.free: 78.6 GB

torch.cuda.2.total: 79.1 GB

torch.cuda.2.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.3.name: NVIDIA H100 80GB HBM3

torch.cuda.3.free: 78.6 GB

torch.cuda.3.total: 79.1 GB

torch.cuda.3.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.4.name: NVIDIA H100 80GB HBM3

torch.cuda.4.free: 78.6 GB

torch.cuda.4.total: 79.1 GB

torch.cuda.4.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.5.name: NVIDIA H100 80GB HBM3

torch.cuda.5.free: 78.6 GB

torch.cuda.5.total: 79.1 GB

torch.cuda.5.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.6.name: NVIDIA H100 80GB HBM3

torch.cuda.6.free: 78.6 GB

torch.cuda.6.total: 79.1 GB

torch.cuda.6.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.7.name: NVIDIA H100 80GB HBM3

torch.cuda.7.free: 78.6 GB

torch.cuda.7.total: 79.1 GB

torch.cuda.7.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

llama_cpp_python:

llama_cpp_python.version: 0.3.2

llama_cpp_python.supports_gpu_offload: True

Bug impact

Please provide information on the impact of this bug to the end user.

Known workaround

Please add any known workarounds.

Additional context

<your text here>
…

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Screenshot 2025-05-08 at 7.24.51 PM.png
897 kB
2025/06/23 2:24 PM
Screenshot 2025-05-08 at 7.25.25 PM.png
527 kB
2025/06/23 2:24 PM
Screenshot 2025-05-08 at 7.26.00 PM.png
478 kB
2025/06/23 2:24 PM
Screenshot 2025-05-08 at 7.26.12 PM.png
483 kB
2025/06/23 2:24 PM

clones

RHELAI-4118 [Documented Workaround] RHEL AI 1.4.5-1 Training fails: Watchdog caught collective operation timeout

Closed

is duplicated by

RHELAI-4087 distributed checkpoint saving (model and optimizer) take more time than expected in `instructlab/training`

Closed

links to

GH issue 3404 "use 512MB/s throughput on GHA CI runners in AWS"

Assignee:: Unassigned

Reporter:: Justin Larkin

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2025/06/23 2:24 PM

Updated:: 2026/01/05 12:52 PM

Resolved:: 2026/01/05 12:52 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates