Loading...

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: rhelai-1.5
Component/s: Accelerators - NVIDIA, InstructLab - SDG
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Severity:
Critical

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

To Reproduce Steps to reproduce the behavior:

Deploy RHEL AI 1.5.2 on a system with 8 of A100 (80 GB)
run ilab data generate

Expected behavior

SDG running and successfully finishing

Device Info (please complete the following information):

Hardware Specs: p4de.24xlarge in AWS (8 of A100 80 GB)
OS Version: RHEL AI 1.5.z
InstructLab Version: ilab, version 0.26.1
Provide the output of these two commands:
- registry.stage.redhat.io/rhelai1/bootc-aws-nvidia-rhel9:1.5.2-1750191774

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
  Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 2: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 3: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 4: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 5: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 6: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 7: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Platform:
  sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
  sys.platform: linux
  os.name: posix
  platform.release: 5.14.0-427.65.1.el9_4.x86_64
  platform.machine: x86_64
  platform.node: ip-172-31-41-154.ec2.internal
  platform.python_version: 3.11.7
  os-release.ID: rhel
  os-release.VERSION_ID: 9.4
  os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
  memory.total: 1121.81 GB
  memory.available: 1028.60 GB
  memory.used: 30.74 GB
InstructLab:
  instructlab.version: 0.26.1
  instructlab-dolomite.version: 0.2.0
  instructlab-eval.version: 0.5.1
  instructlab-quantize.version: 0.1.0
  instructlab-schema.version: 0.4.2
  instructlab-sdg.version: 0.8.3
  instructlab-training.version: 0.10.3
Torch:
  torch.version: 2.6.0
  torch.backends.cpu.capability: AVX512
  torch.version.cuda: 12.4
  torch.version.hip: None
  torch.cuda.available: True
  torch.backends.cuda.is_built: True
  torch.backends.mps.is_built: False
  torch.backends.mps.is_available: False
  torch.cuda.bf16: True
  torch.cuda.current.device: 0
  torch.cuda.0.name: NVIDIA A100-SXM4-80GB
  torch.cuda.0.free: 6.8 GB
  torch.cuda.0.total: 79.1 GB
  torch.cuda.0.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.1.name: NVIDIA A100-SXM4-80GB
  torch.cuda.1.free: 7.7 GB
  torch.cuda.1.total: 79.1 GB
  torch.cuda.1.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.2.name: NVIDIA A100-SXM4-80GB
  torch.cuda.2.free: 7.7 GB
  torch.cuda.2.total: 79.1 GB
  torch.cuda.2.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.3.name: NVIDIA A100-SXM4-80GB
  torch.cuda.3.free: 7.7 GB
  torch.cuda.3.total: 79.1 GB
  torch.cuda.3.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.4.name: NVIDIA A100-SXM4-80GB
  torch.cuda.4.free: 7.7 GB
  torch.cuda.4.total: 79.1 GB
  torch.cuda.4.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.5.name: NVIDIA A100-SXM4-80GB
  torch.cuda.5.free: 7.7 GB
  torch.cuda.5.total: 79.1 GB
  torch.cuda.5.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.6.name: NVIDIA A100-SXM4-80GB
  torch.cuda.6.free: 7.7 GB
  torch.cuda.6.total: 79.1 GB
  torch.cuda.6.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.7.name: NVIDIA A100-SXM4-80GB
  torch.cuda.7.free: 8.0 GB
  torch.cuda.7.total: 79.1 GB
  torch.cuda.7.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
llama_cpp_python:
  llama_cpp_python.version: 0.3.6
  llama_cpp_python.supports_gpu_offload: True

Bug impact

SDG failing.

Known workaround

Please add any known workarounds.

Additional context

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 79.14 GiB of which 171.19 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 5.80 GiB is allocated by PyTorch, and 20.56 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Full logs will be attached.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

a100_80_sdg.log
80 kB
2025/06/19 1:35 PM

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates