Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Undefined
Fix Version/s: None
Affects Version/s: rhelai-1.4.3
Component/s: DevOps, InstructLab - Training, RHELAI - IBMCloud
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

To Reproduce Steps to reproduce the behavior:

Run short training with phased_num_epochs reduced to 2

Expected behavior

AssertionError: MultiQueryAttention should have 1 head for keys and values

Screenshots

Device Info (please complete the following information):

Hardware Specs: IBM Cloud instance (8 H100 GPU)
OS Version: RHEL AI 1.4
InstructLab Version: 0.23.5
Provide the output of these two commands:
- sudo bootc status --format json | jq .status.booted.image.image.image to print the name and tag of the bootc image, should look like registry.stage.redhat.io/rhelai1/bootc-intel-rhel9:1.3-1732894187
  - "registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.4"

- ilab system info
  - platform.node: rhelai-qe-h100-br-sao

platform.python_version: 3.11.7

os-release.ID: rhel

os-release.VERSION_ID: 9.4

os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)

memory.total: 1763.83 GB

memory.available: 1752.49 GB

memory.used: 3.41 GB

InstructLab:

instructlab.version: 0.23.5

instructlab-dolomite.version: 0.2.0

instructlab-eval.version: 0.5.1

instructlab-quantize.version: 0.1.0

instructlab-schema.version: 0.4.2

instructlab-sdg.version: 0.7.2

instructlab-training.version: 0.7.0

Torch:

torch.version: 2.5.1

torch.backends.cpu.capability: AVX512

torch.version.cuda: 12.4

torch.version.hip: None

torch.cuda.available: True

torch.backends.cuda.is_built: True

torch.backends.mps.is_built: False

torch.backends.mps.is_available: False

torch.cuda.bf16: True

torch.cuda.current.device: 0

torch.cuda.0.name: NVIDIA H100 80GB HBM3

torch.cuda.0.free: 78.6 GB

torch.cuda.0.total: 79.1 GB

torch.cuda.0.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.1.name: NVIDIA H100 80GB HBM3

torch.cuda.1.free: 78.6 GB

torch.cuda.1.total: 79.1 GB

torch.cuda.1.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.2.name: NVIDIA H100 80GB HBM3

torch.cuda.2.free: 78.6 GB

torch.cuda.2.total: 79.1 GB

torch.cuda.2.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.3.name: NVIDIA H100 80GB HBM3

torch.cuda.3.free: 78.6 GB

torch.cuda.3.total: 79.1 GB

torch.cuda.3.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.4.name: NVIDIA H100 80GB HBM3

torch.cuda.4.free: 78.6 GB

torch.cuda.4.total: 79.1 GB

torch.cuda.4.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.5.name: NVIDIA H100 80GB HBM3

torch.cuda.5.free: 78.6 GB

torch.cuda.5.total: 79.1 GB

torch.cuda.5.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.6.name: NVIDIA H100 80GB HBM3

torch.cuda.6.free: 78.6 GB

torch.cuda.6.total: 79.1 GB

torch.cuda.6.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.7.name: NVIDIA H100 80GB HBM3

torch.cuda.7.free: 78.6 GB

torch.cuda.7.total: 79.1 GB

torch.cuda.7.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

llama_cpp_python:

llama_cpp_python.version: 0.3.2

llama_cpp_python.supports_gpu_offload: True

Bug impact

Please provide information on the impact of this bug to the end user.

Known workaround

Please add any known workarounds.

Additional context

<your text here>
…

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Screenshot 2025-04-01 at 8.57.42 PM.png
482 kB
2025/04/02 2:05 AM
Screenshot 2025-04-01 at 8.58.08 PM.png
517 kB
2025/04/02 2:05 AM
Screenshot 2025-04-01 at 8.56.53 PM.png
689 kB
2025/04/02 2:05 AM

is related to

RHELAI-3623 RHEL AI 1.4.1: Multi-phase training fails with a distributed training error

Resolved

Assignee:: James Kunstle (Inactive)

Reporter:: Justin Larkin

QA Contact:: Kamesh Akella

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/04/02 2:24 AM

Updated:: 2025/04/02 6:41 PM

Resolved:: 2025/04/02 6:41 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates