XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Critical

      To Reproduce Steps to reproduce the behavior:

      1. Deploy RHEL AI 1.5.2 on a system with 8 of A100 (80 GB)
      2. run ilab data generate

      Expected behavior

      • SDG running and successfully finishing

      Device Info (please complete the following information):

      • Hardware Specs: p4de.24xlarge in AWS (8 of A100 80 GB)
      • OS Version: RHEL AI 1.5.z
      • InstructLab Version: ilab, version 0.26.1
      • Provide the output of these two commands:
        • registry.stage.redhat.io/rhelai1/bootc-aws-nvidia-rhel9:1.5.2-1750191774

       

      ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
      ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
      ggml_cuda_init: found 8 CUDA devices:
        Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
        Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
        Device 2: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
        Device 3: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
        Device 4: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
        Device 5: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
        Device 6: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
        Device 7: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Platform:
        sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
        sys.platform: linux
        os.name: posix
        platform.release: 5.14.0-427.65.1.el9_4.x86_64
        platform.machine: x86_64
        platform.node: ip-172-31-41-154.ec2.internal
        platform.python_version: 3.11.7
        os-release.ID: rhel
        os-release.VERSION_ID: 9.4
        os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
        memory.total: 1121.81 GB
        memory.available: 1028.60 GB
        memory.used: 30.74 GB
      InstructLab:
        instructlab.version: 0.26.1
        instructlab-dolomite.version: 0.2.0
        instructlab-eval.version: 0.5.1
        instructlab-quantize.version: 0.1.0
        instructlab-schema.version: 0.4.2
        instructlab-sdg.version: 0.8.3
        instructlab-training.version: 0.10.3
      Torch:
        torch.version: 2.6.0
        torch.backends.cpu.capability: AVX512
        torch.version.cuda: 12.4
        torch.version.hip: None
        torch.cuda.available: True
        torch.backends.cuda.is_built: True
        torch.backends.mps.is_built: False
        torch.backends.mps.is_available: False
        torch.cuda.bf16: True
        torch.cuda.current.device: 0
        torch.cuda.0.name: NVIDIA A100-SXM4-80GB
        torch.cuda.0.free: 6.8 GB
        torch.cuda.0.total: 79.1 GB
        torch.cuda.0.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.1.name: NVIDIA A100-SXM4-80GB
        torch.cuda.1.free: 7.7 GB
        torch.cuda.1.total: 79.1 GB
        torch.cuda.1.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.2.name: NVIDIA A100-SXM4-80GB
        torch.cuda.2.free: 7.7 GB
        torch.cuda.2.total: 79.1 GB
        torch.cuda.2.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.3.name: NVIDIA A100-SXM4-80GB
        torch.cuda.3.free: 7.7 GB
        torch.cuda.3.total: 79.1 GB
        torch.cuda.3.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.4.name: NVIDIA A100-SXM4-80GB
        torch.cuda.4.free: 7.7 GB
        torch.cuda.4.total: 79.1 GB
        torch.cuda.4.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.5.name: NVIDIA A100-SXM4-80GB
        torch.cuda.5.free: 7.7 GB
        torch.cuda.5.total: 79.1 GB
        torch.cuda.5.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.6.name: NVIDIA A100-SXM4-80GB
        torch.cuda.6.free: 7.7 GB
        torch.cuda.6.total: 79.1 GB
        torch.cuda.6.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.7.name: NVIDIA A100-SXM4-80GB
        torch.cuda.7.free: 8.0 GB
        torch.cuda.7.total: 79.1 GB
        torch.cuda.7.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
      llama_cpp_python:
        llama_cpp_python.version: 0.3.6
        llama_cpp_python.supports_gpu_offload: True
      

       

      Bug impact

      • SDG failing.

      Known workaround

      • Please add any known workarounds.

      Additional context

      torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 79.14 GiB of which 171.19 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 5.80 GiB is allocated by PyTorch, and 20.56 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

       

      Full logs will be attached.

        1. a100_80_sdg.log
          80 kB
          František Zatloukal

              fdupont@redhat.com Fabien Dupont
              fzatlouk@redhat.com František Zatloukal
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: