Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-726

RHEL AI 1.4.3 - vllm fails to start on AMD for SDG - Ready to be tested

    • False
    • Hide

      None

      Show
      None
    • False
    • Known Issue
    • AIPCC Sprint 2, AIPCC Application Platform 3
    • Approved

      To Reproduce Steps to reproduce the behavior:

      1. Deploy RHEL AI 1.4.3 onto Azure
      2. Prepare the system (ilab config init, download models)
      3. Run ilab data generate
      4. Observe assert isinstance(module, BaseLayerWithLoRA) vllm traceback

      Expected behavior

      • Successful SDG

      Device Info (please complete the following information):

      • Hardware Specs: Standard_ND96asr_v4 (8*MI300X)
      • OS Version: RHEL AI 1.4.3
      • InstructLab Version: 0.23.3
      • Provide the output of these two commands:
        • sudo bootc status --format json | jq .status.booted.image.image.image :
        • "registry.stage.redhat.io/rhelai1/bootc-azure-amd-rhel9:1.4.3-1741712118"
        • ilab system info

       

       

      Platform:
        sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
        sys.platform: linux
        os.name: posix
        platform.release: 5.14.0-427.55.1.el9_4.x86_64
        platform.machine: x86_64
        platform.node: fzatlouk-rhelai-1.3-amd-test-westus
        platform.python_version: 3.11.7
        os-release.ID: rhel
        os-release.VERSION_ID: 9.4
        os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
        memory.total: 1820.96 GB
        memory.available: 1784.80 GB
        memory.used: 29.04 GB
      InstructLab:
        instructlab.version: 0.23.3
        instructlab-dolomite.version: 0.2.0
        instructlab-eval.version: 0.5.1
        instructlab-quantize.version: 0.1.0
        instructlab-schema.version: 0.4.2
        instructlab-sdg.version: 0.7.1
        instructlab-training.version: 0.7.0
      Torch:
        torch.version: 2.4.1
        torch.backends.cpu.capability: AVX512
        torch.version.cuda: None
        torch.version.hip: 6.2.41134-65d174c3e
        torch.cuda.available: True
        torch.backends.cuda.is_built: True
        torch.backends.mps.is_built: False
        torch.backends.mps.is_available: False
        torch.cuda.bf16: True
        torch.cuda.current.device: 0
        torch.cuda.0.name: AMD Radeon Graphics
        torch.cuda.0.free: 191.0 GB
        torch.cuda.0.total: 191.5 GB
        torch.cuda.0.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.1.name: AMD Radeon Graphics
        torch.cuda.1.free: 191.0 GB
        torch.cuda.1.total: 191.5 GB
        torch.cuda.1.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.2.name: AMD Radeon Graphics
        torch.cuda.2.free: 191.0 GB
        torch.cuda.2.total: 191.5 GB
        torch.cuda.2.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.3.name: AMD Radeon Graphics
        torch.cuda.3.free: 191.0 GB
        torch.cuda.3.total: 191.5 GB
        torch.cuda.3.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.4.name: AMD Radeon Graphics
        torch.cuda.4.free: 191.0 GB
        torch.cuda.4.total: 191.5 GB
        torch.cuda.4.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.5.name: AMD Radeon Graphics
        torch.cuda.5.free: 191.0 GB
        torch.cuda.5.total: 191.5 GB
        torch.cuda.5.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.6.name: AMD Radeon Graphics
        torch.cuda.6.free: 191.0 GB
        torch.cuda.6.total: 191.5 GB
        torch.cuda.6.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.7.name: AMD Radeon Graphics
        torch.cuda.7.free: 191.0 GB
        torch.cuda.7.total: 191.5 GB
        torch.cuda.7.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
      llama_cpp_python:
        llama_cpp_python.version: 0.3.2
        llama_cpp_python.supports_gpu_offload: False
      

       

       

      Bug impact

      • SDG can't be ran

      Known workaround

      • N/A

      Additional context

      ilab model serve and ilab model chat works just fine.

      The initial failure seems to be (full log attached):

      ERROR 03-12 17:29:29 engine.py:366] 
      Traceback (most recent call last):
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
          engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
          return cls(ipc_path=ipc_path,
                 ^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__
          self.engine = LLMEngine(*args, **kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 288, in __init__
          self.model_executor = executor_class(vllm_config=vllm_config, )
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
          super().__init__(*args, **kwargs)
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/executor_base.py", line 36, in __init__
          self._init_executor()
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 83, in _init_executor
          self._run_workers("load_model",
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 157, in _run_workers
          driver_worker_output = driver_worker_method(*args, **kwargs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 183, in load_model
          self.model_runner.load_model()
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 1124, in load_model
          self.model = self.lora_manager.create_lora_manager(self.model)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/worker_manager.py", line 174, in create_lora_manager
          lora_manager = create_lora_manager(
                         ^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 755, in create_lora_manager
          lora_manager = lora_manager_cls(
                         ^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 678, in __init__
          super().__init__(model, max_num_seqs, max_num_batched_tokens,
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 353, in __init__
          self._create_lora_modules()
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 507, in _create_lora_modules
          self.register_module(module_name, new_module)
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 513, in register_module
          assert isinstance(module, BaseLayerWithLoRA)
      AssertionError
      Process SpawnProcess-1

       

        1. ilab_data_generate_amd.log
          45 kB
          František Zatloukal
        2. ilab_cfg.md
          20 kB
          František Zatloukal
        3. sdg_abort.log
          503 kB
          František Zatloukal

              rh-ee-jgroenen Joseph Groenenboom
              fzatlouk@redhat.com František Zatloukal
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

                Created:
                Updated:
                Resolved: