-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
rhelai-1.5
-
None
-
False
-
-
False
-
-
-
Critical
To Reproduce Steps to reproduce the behavior:
- Deploy RHEL AI 1.5.2 on a system with 8 of A100 (80 GB)
- run ilab data generate
Expected behavior
- SDG running and successfully finishing
Device Info (please complete the following information):
- Hardware Specs: p4de.24xlarge in AWS (8 of A100 80 GB)
- OS Version: RHEL AI 1.5.z
- InstructLab Version: ilab, version 0.26.1
- Provide the output of these two commands:
- registry.stage.redhat.io/rhelai1/bootc-aws-nvidia-rhel9:1.5.2-1750191774
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 8 CUDA devices: Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes Device 2: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes Device 3: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes Device 4: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes Device 5: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes Device 6: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes Device 7: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes Platform: sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] sys.platform: linux os.name: posix platform.release: 5.14.0-427.65.1.el9_4.x86_64 platform.machine: x86_64 platform.node: ip-172-31-41-154.ec2.internal platform.python_version: 3.11.7 os-release.ID: rhel os-release.VERSION_ID: 9.4 os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow) memory.total: 1121.81 GB memory.available: 1028.60 GB memory.used: 30.74 GB InstructLab: instructlab.version: 0.26.1 instructlab-dolomite.version: 0.2.0 instructlab-eval.version: 0.5.1 instructlab-quantize.version: 0.1.0 instructlab-schema.version: 0.4.2 instructlab-sdg.version: 0.8.3 instructlab-training.version: 0.10.3 Torch: torch.version: 2.6.0 torch.backends.cpu.capability: AVX512 torch.version.cuda: 12.4 torch.version.hip: None torch.cuda.available: True torch.backends.cuda.is_built: True torch.backends.mps.is_built: False torch.backends.mps.is_available: False torch.cuda.bf16: True torch.cuda.current.device: 0 torch.cuda.0.name: NVIDIA A100-SXM4-80GB torch.cuda.0.free: 6.8 GB torch.cuda.0.total: 79.1 GB torch.cuda.0.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.1.name: NVIDIA A100-SXM4-80GB torch.cuda.1.free: 7.7 GB torch.cuda.1.total: 79.1 GB torch.cuda.1.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.2.name: NVIDIA A100-SXM4-80GB torch.cuda.2.free: 7.7 GB torch.cuda.2.total: 79.1 GB torch.cuda.2.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.3.name: NVIDIA A100-SXM4-80GB torch.cuda.3.free: 7.7 GB torch.cuda.3.total: 79.1 GB torch.cuda.3.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.4.name: NVIDIA A100-SXM4-80GB torch.cuda.4.free: 7.7 GB torch.cuda.4.total: 79.1 GB torch.cuda.4.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.5.name: NVIDIA A100-SXM4-80GB torch.cuda.5.free: 7.7 GB torch.cuda.5.total: 79.1 GB torch.cuda.5.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.6.name: NVIDIA A100-SXM4-80GB torch.cuda.6.free: 7.7 GB torch.cuda.6.total: 79.1 GB torch.cuda.6.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.7.name: NVIDIA A100-SXM4-80GB torch.cuda.7.free: 8.0 GB torch.cuda.7.total: 79.1 GB torch.cuda.7.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute) llama_cpp_python: llama_cpp_python.version: 0.3.6 llama_cpp_python.supports_gpu_offload: True
Bug impact
- SDG failing.
Known workaround
- Please add any known workarounds.
Additional context
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 79.14 GiB of which 171.19 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 5.80 GiB is allocated by PyTorch, and 20.56 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Full logs will be attached.