Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-4184

`ilab data generate` only uses 1 GPU on AMD (8x MI300X)

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Important

      To Reproduce Steps to reproduce the behavior:

      1. Spin up Azure Standard_ND96is_MI300X_v5 VM
      2. Download the required granite, mixtral, prometheus, skill/knowledge-adapter models
      3. run `ilab data generate` and observe GPU load

      Expected behavior

      • all availabe GPUs are used

      Device Info (please complete the following information):

      • Hardware Specs: Azure Standard_ND96is_MI300X_v5 VM
      • InstructLab Version: 0.26.1

      sudo bootc status ...:

      registry.redhat.io/rhelai1/bootc-azure-amd-rhel9:1.5

       

      ilab system info:

      Platform:
        sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
        sys.platform: linux
        os.name: posix
        platform.release: 5.14.0-427.65.1.el9_4.x86_64
        platform.machine: x86_64
        platform.node: fzatlouk-rhelai-1.5-amd-test-westus
        platform.python_version: 3.11.7
        os-release.ID: rhel
        os-release.VERSION_ID: 9.4
        os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
        memory.total: 1820.96 GB
        memory.available: 1784.04 GB
        memory.used: 29.10 GB

      InstructLab:
        instructlab.version: 0.26.1
        instructlab-dolomite.version: 0.2.0
        instructlab-eval.version: 0.5.1
        instructlab-quantize.version: 0.1.0
        instructlab-schema.version: 0.4.2
        instructlab-sdg.version: 0.8.2
        instructlab-training.version: 0.10.2

      Torch:
        torch.version: 2.6.0
        torch.backends.cpu.capability: AVX512
        torch.version.cuda: None
        torch.version.hip: 6.3.42134-a9a80e791
        torch.cuda.available: True
        torch.backends.cuda.is_built: True
        torch.backends.mps.is_built: False
        torch.backends.mps.is_available: False
        torch.cuda.bf16: True
        torch.cuda.current.device: 0
        torch.cuda.0.name: AMD Radeon Graphics
        torch.cuda.0.free: 191.0 GB
        torch.cuda.0.total: 191.5 GB
        torch.cuda.0.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.1.name: AMD Radeon Graphics
        torch.cuda.1.free: 191.0 GB
        torch.cuda.1.total: 191.5 GB
        torch.cuda.1.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.2.name: AMD Radeon Graphics
        torch.cuda.2.free: 191.0 GB
        torch.cuda.2.total: 191.5 GB
        torch.cuda.2.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.3.name: AMD Radeon Graphics
        torch.cuda.3.free: 191.0 GB
        torch.cuda.3.total: 191.5 GB
        torch.cuda.3.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.4.name: AMD Radeon Graphics
        torch.cuda.4.free: 191.0 GB
        torch.cuda.4.total: 191.5 GB
        torch.cuda.4.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.5.name: AMD Radeon Graphics
        torch.cuda.5.free: 191.0 GB
        torch.cuda.5.total: 191.5 GB
        torch.cuda.5.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.6.name: AMD Radeon Graphics
        torch.cuda.6.free: 191.0 GB
        torch.cuda.6.total: 191.5 GB
        torch.cuda.6.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.7.name: AMD Radeon Graphics
        torch.cuda.7.free: 191.0 GB
        torch.cuda.7.total: 191.5 GB
        torch.cuda.7.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)

      llama_cpp_python:
        llama_cpp_python.version: 0.3.6
        llama_cpp_python.supports_gpu_offload: False

      Bug impact

      • SDG is unnecessarily slow.

              rh-ee-raravind Reshmi Aravind
              jskladan@redhat.com Josef Skladanka
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: