Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-3879

ilab train AssertionError: MultiQueryAttention should have 1 head for keys and values

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False

      To Reproduce Steps to reproduce the behavior:

      Run short training with phased_num_epochs reduced to 2

      Expected behavior

      • AssertionError: MultiQueryAttention should have 1 head for keys and values

      Screenshots

      Device Info (please complete the following information):

      • Hardware Specs: IBM Cloud instance (8 H100 GPU)
      • OS Version: RHEL AI 1.4
      • InstructLab Version: 0.23.5
      • Provide the output of these two commands:
        • ilab system info 
          •   platform.node: rhelai-qe-h100-br-sao

        platform.python_version: 3.11.7

        os-release.ID: rhel

        os-release.VERSION_ID: 9.4

        os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)

        memory.total: 1763.83 GB

        memory.available: 1752.49 GB

        memory.used: 3.41 GB

       

      InstructLab:

        instructlab.version: 0.23.5

        instructlab-dolomite.version: 0.2.0

        instructlab-eval.version: 0.5.1

        instructlab-quantize.version: 0.1.0

        instructlab-schema.version: 0.4.2

        instructlab-sdg.version: 0.7.2

        instructlab-training.version: 0.7.0

       

      Torch:

        torch.version: 2.5.1

        torch.backends.cpu.capability: AVX512

        torch.version.cuda: 12.4

        torch.version.hip: None

        torch.cuda.available: True

        torch.backends.cuda.is_built: True

        torch.backends.mps.is_built: False

        torch.backends.mps.is_available: False

        torch.cuda.bf16: True

        torch.cuda.current.device: 0

        torch.cuda.0.name: NVIDIA H100 80GB HBM3

        torch.cuda.0.free: 78.6 GB

        torch.cuda.0.total: 79.1 GB

        torch.cuda.0.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

        torch.cuda.1.name: NVIDIA H100 80GB HBM3

        torch.cuda.1.free: 78.6 GB

        torch.cuda.1.total: 79.1 GB

        torch.cuda.1.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

        torch.cuda.2.name: NVIDIA H100 80GB HBM3

        torch.cuda.2.free: 78.6 GB

        torch.cuda.2.total: 79.1 GB

        torch.cuda.2.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

        torch.cuda.3.name: NVIDIA H100 80GB HBM3

        torch.cuda.3.free: 78.6 GB

        torch.cuda.3.total: 79.1 GB

        torch.cuda.3.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

        torch.cuda.4.name: NVIDIA H100 80GB HBM3

        torch.cuda.4.free: 78.6 GB

        torch.cuda.4.total: 79.1 GB

        torch.cuda.4.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

        torch.cuda.5.name: NVIDIA H100 80GB HBM3

        torch.cuda.5.free: 78.6 GB

        torch.cuda.5.total: 79.1 GB

        torch.cuda.5.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

        torch.cuda.6.name: NVIDIA H100 80GB HBM3

        torch.cuda.6.free: 78.6 GB

        torch.cuda.6.total: 79.1 GB

        torch.cuda.6.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

        torch.cuda.7.name: NVIDIA H100 80GB HBM3

        torch.cuda.7.free: 78.6 GB

        torch.cuda.7.total: 79.1 GB

        torch.cuda.7.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

       

      llama_cpp_python:

        llama_cpp_python.version: 0.3.2

        llama_cpp_python.supports_gpu_offload: True

      Bug impact

      • Please provide information on the impact of this bug to the end user.

      Known workaround

      • Please add any known workarounds.

      Additional context

      • <your text here>

              rhn-support-jkunstle James Kunstle (Inactive)
              rh-ee-jlarkin Justin Larkin
              Kamesh Akella Kamesh Akella
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: