Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-3659

Failing to start vllm with mixtral on AMD

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Major Major
    • rhelai-1.4.4
    • rhelai-1.4.3
    • InstructLab - SDG
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      To Reproduce Steps to reproduce the behavior:

      1. Init with the mi300x x 8 profile
      2. Run `ilab data generate` on AMD mi300x x8

      Example output:

      [cloud-user@lab-mi300x cloud-user]$ ilab data generate --enable-serving-output
      INFO 2025-03-14 02:36:50,453 instructlab.process.process:241: Started subprocess with PID 1. Logs are being written to /mnt/instructlab/.local/share/instructlab/logs/generation/generation-2ecc2dd6-007d-11f0-a6ce-6045bd01a29c.log.
      INFO 2025-03-14 02:36:51,010 instructlab.model.backends.vllm:115: Trying to connect to model server at http://127.0.0.1:8000/v1
      INFO 2025-03-14 02:36:52,248 instructlab.model.backends.vllm:332: vLLM starting up on pid 5 at http://127.0.0.1:55565/v1
      INFO 2025-03-14 02:36:52,248 instructlab.model.backends.vllm:123: Starting a temporary vLLM server at http://127.0.0.1:55565/v1
      INFO 2025-03-14 02:36:52,248 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 1/120
      INFO 2025-03-14 02:36:55,545 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 2/120
      WARNING 03-14 02:36:57 rocm.py:34] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead.
      /opt/app-root/lib64/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
      No module named 'vllm._version'
        from vllm.version import __version__ as VLLM_VERSION
      INFO 03-14 02:36:58 api_server.py:643] vLLM API server version 0.6.4.post1
      INFO 03-14 02:36:58 api_server.py:644] args: Namespace(host='127.0.0.1', port=55565, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=[LoRAModulePath(name='skill-classifier-v3-clm', path='/mnt/instructlab/.cache/instructlab/models/skills-adapter-v3', base_model_name=None), LoRAModulePath(name='text-classifier-knowledge-v3-clm', path='/mnt/instructlab/.cache/instructlab/models/knowledge-adapter-v3', base_model_name=None)], prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/mnt/instructlab/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='bfloat16', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend='mp', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=True, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=512, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, mm_cache_preprocessor=False, enable_lora=True, enable_lora_bias=False, max_loras=1, max_lora_rank=64, lora_extra_vocab_size=256, lora_dtype='bfloat16', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=True, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
      INFO 03-14 02:36:58 api_server.py:198] Started engine process with PID 25
      INFO 2025-03-14 02:36:58,977 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 3/120
      /opt/app-root/lib64/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
      No module named 'vllm._version'
        from vllm.version import __version__ as VLLM_VERSION
      INFO 2025-03-14 02:37:02,409 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 4/120
      INFO 03-14 02:37:03 config.py:444] This model supports multiple tasks: {'embed', 'score', 'reward', 'classify', 'generate'}. Defaulting to 'generate'.
      INFO 2025-03-14 02:37:05,682 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 5/120
      INFO 03-14 02:37:06 config.py:444] This model supports multiple tasks: {'reward', 'classify', 'generate', 'score', 'embed'}. Defaulting to 'generate'.
      INFO 03-14 02:37:07 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='/mnt/instructlab/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', speculative_config=None, tokenizer='/mnt/instructlab/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/mnt/instructlab/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}, use_cached_outputs=True, 
      WARNING 03-14 02:37:07 multiproc_worker_utils.py:312] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
      INFO 03-14 02:37:08 selector.py:134] Using ROCmFlashAttention backend.
      INFO 03-14 02:37:08 model_runner.py:1090] Starting to load model /mnt/instructlab/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1...
      WARNING 03-14 02:37:08 registry.py:315] Model architecture 'MixtralForCausalLM' is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting `VLLM_USE_TRITON_FLASH_ATTN=0`
      Loading safetensors checkpoint shards:   0% Completed | 0/19 [00:00<?, ?it/s]
      INFO 2025-03-14 02:37:09,101 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 6/120
      Loading safetensors checkpoint shards:   5% Completed | 1/19 [00:01<00:27,  1.52s/it]
      Loading safetensors checkpoint shards:  11% Completed | 2/19 [00:03<00:27,  1.60s/it]
      INFO 2025-03-14 02:37:12,316 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 7/120
      Loading safetensors checkpoint shards:  16% Completed | 3/19 [00:04<00:26,  1.63s/it]
      Loading safetensors checkpoint shards:  21% Completed | 4/19 [00:06<00:24,  1.64s/it]
      INFO 2025-03-14 02:37:15,695 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 8/120
      Loading safetensors checkpoint shards:  26% Completed | 5/19 [00:08<00:23,  1.66s/it]
      Loading safetensors checkpoint shards:  32% Completed | 6/19 [00:09<00:21,  1.68s/it]
      INFO 2025-03-14 02:37:19,140 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 9/120
      Loading safetensors checkpoint shards:  37% Completed | 7/19 [00:11<00:20,  1.68s/it]
      Loading safetensors checkpoint shards:  42% Completed | 8/19 [00:13<00:18,  1.68s/it]
      INFO 2025-03-14 02:37:22,486 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 10/120
      Loading safetensors checkpoint shards:  47% Completed | 9/19 [00:14<00:16,  1.61s/it]
      Loading safetensors checkpoint shards:  53% Completed | 10/19 [00:16<00:14,  1.63s/it]
      INFO 2025-03-14 02:37:25,671 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 11/120
      Loading safetensors checkpoint shards:  58% Completed | 11/19 [00:18<00:13,  1.66s/it]
      Loading safetensors checkpoint shards:  63% Completed | 12/19 [00:19<00:11,  1.66s/it]
      INFO 2025-03-14 02:37:28,891 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 12/120
      Loading safetensors checkpoint shards:  68% Completed | 13/19 [00:21<00:09,  1.67s/it]
      Loading safetensors checkpoint shards:  74% Completed | 14/19 [00:23<00:08,  1.66s/it]
      INFO 2025-03-14 02:37:32,264 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 13/120
      Loading safetensors checkpoint shards:  79% Completed | 15/19 [00:24<00:06,  1.68s/it]
      Loading safetensors checkpoint shards:  84% Completed | 16/19 [00:26<00:05,  1.67s/it]
      INFO 2025-03-14 02:37:35,710 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 14/120
      Loading safetensors checkpoint shards:  89% Completed | 17/19 [00:28<00:03,  1.68s/it]
      Loading safetensors checkpoint shards:  95% Completed | 18/19 [00:29<00:01,  1.68s/it]
      INFO 2025-03-14 02:37:38,883 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 15/120
      Loading safetensors checkpoint shards: 100% Completed | 19/19 [00:31<00:00,  1.69s/it]
      Loading safetensors checkpoint shards: 100% Completed | 19/19 [00:31<00:00,  1.66s/it]
      
      INFO 03-14 02:37:40 model_runner.py:1095] Loading model weights took 87.0026 GB
      INFO 03-14 02:37:40 punica_selector.py:11] Using PunicaWrapperGPU.
      ERROR 03-14 02:37:40 engine.py:366] 
      Traceback (most recent call last):
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
          engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
          return cls(ipc_path=ipc_path,
                 ^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__
          self.engine = LLMEngine(*args, **kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 288, in __init__
          self.model_executor = executor_class(vllm_config=vllm_config, )
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
          super().__init__(*args, **kwargs)
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/executor_base.py", line 36, in __init__
          self._init_executor()
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 83, in _init_executor
          self._run_workers("load_model",
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 157, in _run_workers
          driver_worker_output = driver_worker_method(*args, **kwargs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 183, in load_model
          self.model_runner.load_model()
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 1124, in load_model
          self.model = self.lora_manager.create_lora_manager(self.model)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/worker_manager.py", line 174, in create_lora_manager
          lora_manager = create_lora_manager(
                         ^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 755, in create_lora_manager
          lora_manager = lora_manager_cls(
                         ^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 678, in __init__
          super().__init__(model, max_num_seqs, max_num_batched_tokens,
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 353, in __init__
          self._create_lora_modules()
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 507, in _create_lora_modules
          self.register_module(module_name, new_module)
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 513, in register_module
          assert isinstance(module, BaseLayerWithLoRA)
      AssertionError
      Process SpawnProcess-1:
      Traceback (most recent call last):
        File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
          self.run()
        File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
          self._target(*self._args, **self._kwargs)
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 368, in run_mp_engine
          raise e
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
          engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
          return cls(ipc_path=ipc_path,
                 ^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__
          self.engine = LLMEngine(*args, **kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 288, in __init__
          self.model_executor = executor_class(vllm_config=vllm_config, )
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
          super().__init__(*args, **kwargs)
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/executor_base.py", line 36, in __init__
          self._init_executor()
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 83, in _init_executor
          self._run_workers("load_model",
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 157, in _run_workers
          driver_worker_output = driver_worker_method(*args, **kwargs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 183, in load_model
          self.model_runner.load_model()
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 1124, in load_model
          self.model = self.lora_manager.create_lora_manager(self.model)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/worker_manager.py", line 174, in create_lora_manager
          lora_manager = create_lora_manager(
                         ^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 755, in create_lora_manager
          lora_manager = lora_manager_cls(
                         ^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 678, in __init__
          super().__init__(model, max_num_seqs, max_num_batched_tokens,
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 353, in __init__
          self._create_lora_modules()
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 507, in _create_lora_modules
          self.register_module(module_name, new_module)
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 513, in register_module
          assert isinstance(module, BaseLayerWithLoRA)
      AssertionError
      INFO 2025-03-14 02:37:42,149 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 16/120
      Task exception was never retrieved
      future: <Task finished name='Task-2' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
      Traceback (most recent call last):
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
          while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/zmq/_future.py", line 372, in poll
          raise _zmq.ZMQError(_zmq.ENOTSUP)
      zmq.error.ZMQError: Operation not supported
      Task exception was never retrieved
      future: <Task finished name='Task-3' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
      Traceback (most recent call last):
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
          while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/zmq/_future.py", line 372, in poll
          raise _zmq.ZMQError(_zmq.ENOTSUP)
      zmq.error.ZMQError: Operation not supported
      Task exception was never retrieved
      future: <Task finished name='Task-4' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
      Traceback (most recent call last):
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
          while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/zmq/_future.py", line 372, in poll
          raise _zmq.ZMQError(_zmq.ENOTSUP)
      zmq.error.ZMQError: Operation not supported
      Traceback (most recent call last):
        File "<frozen runpy>", line 198, in _run_module_as_main
        File "<frozen runpy>", line 88, in _run_code
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 701, in <module>
          uvloop.run(run_server(args))
        File "/opt/app-root/lib64/python3.11/site-packages/uvloop/__init__.py", line 105, in run
          return runner.run(wrapper())
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/usr/lib64/python3.11/asyncio/runners.py", line 118, in run
          return self._loop.run_until_complete(task)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
        File "/opt/app-root/lib64/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
          return await main
                 ^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 667, in run_server
          async with build_async_engine_client(args) as engine_client:
        File "/usr/lib64/python3.11/contextlib.py", line 210, in __aenter__
          return await anext(self.gen)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 117, in build_async_engine_client
          async with build_async_engine_client_from_engine_args(
        File "/usr/lib64/python3.11/contextlib.py", line 210, in __aenter__
          return await anext(self.gen)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 222, in build_async_engine_client_from_engine_args
          raise RuntimeError(
      RuntimeError: Engine process failed to start. See stack trace for the root cause.
      INFO 2025-03-14 02:37:45,380 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 17/120
      INFO 2025-03-14 02:37:48,751 instructlab.model.backends.vllm:180: vLLM startup failed.  Retrying (1/1)
      ERROR 2025-03-14 02:37:48,751 instructlab.model.backends.vllm:185: vLLM failed to start.
      INFO 2025-03-14 02:37:48,751 instructlab.model.backends.vllm:115: Trying to connect to model server at http://127.0.0.1:8000/v1
      INFO 2025-03-14 02:37:50,130 instructlab.model.backends.vllm:332: vLLM starting up on pid 179 at http://127.0.0.1:60397/v1
      INFO 2025-03-14 02:37:50,131 instructlab.model.backends.vllm:123: Starting a temporary vLLM server at http://127.0.0.1:60397/v1
      INFO 2025-03-14 02:37:50,131 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 1/120
      WARNING 03-14 02:37:52 rocm.py:34] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead.
      /opt/app-root/lib64/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
      No module named 'vllm._version'
        from vllm.version import __version__ as VLLM_VERSION
      INFO 2025-03-14 02:37:53,334 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 2/120
      INFO 03-14 02:37:53 api_server.py:643] vLLM API server version 0.6.4.post1
      INFO 03-14 02:37:53 api_server.py:644] args: Namespace(host='127.0.0.1', port=60397, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=[LoRAModulePath(name='skill-classifier-v3-clm', path='/mnt/instructlab/.cache/instructlab/models/skills-adapter-v3', base_model_name=None), LoRAModulePath(name='text-classifier-knowledge-v3-clm', path='/mnt/instructlab/.cache/instructlab/models/knowledge-adapter-v3', base_model_name=None)], prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/mnt/instructlab/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='bfloat16', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend='mp', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=True, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=512, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, mm_cache_preprocessor=False, enable_lora=True, enable_lora_bias=False, max_loras=1, max_lora_rank=64, lora_extra_vocab_size=256, lora_dtype='bfloat16', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=True, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
      INFO 03-14 02:37:53 api_server.py:198] Started engine process with PID 199
      /opt/app-root/lib64/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
      No module named 'vllm._version'
        from vllm.version import __version__ as VLLM_VERSION
      INFO 2025-03-14 02:37:56,792 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 3/120
      INFO 03-14 02:37:58 config.py:444] This model supports multiple tasks: {'score', 'classify', 'generate', 'reward', 'embed'}. Defaulting to 'generate'.
      INFO 2025-03-14 02:38:00,171 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 4/120
      INFO 03-14 02:38:01 config.py:444] This model supports multiple tasks: {'score', 'reward', 'embed', 'classify', 'generate'}. Defaulting to 'generate'.
      INFO 03-14 02:38:03 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='/mnt/instructlab/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', speculative_config=None, tokenizer='/mnt/instructlab/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/mnt/instructlab/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}, use_cached_outputs=True, 
      WARNING 03-14 02:38:03 multiproc_worker_utils.py:312] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
      INFO 03-14 02:38:03 selector.py:134] Using ROCmFlashAttention backend.
      INFO 2025-03-14 02:38:03,499 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 5/120
      INFO 03-14 02:38:03 model_runner.py:1090] Starting to load model /mnt/instructlab/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1...
      WARNING 03-14 02:38:03 registry.py:315] Model architecture 'MixtralForCausalLM' is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting `VLLM_USE_TRITON_FLASH_ATTN=0`
      Loading safetensors checkpoint shards:   0% Completed | 0/19 [00:00<?, ?it/s]
      Loading safetensors checkpoint shards:   5% Completed | 1/19 [00:01<00:24,  1.34s/it]
      Loading safetensors checkpoint shards:  11% Completed | 2/19 [00:02<00:23,  1.41s/it]
      INFO 2025-03-14 02:38:06,664 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 6/120
      Loading safetensors checkpoint shards:  16% Completed | 3/19 [00:04<00:22,  1.43s/it]
      Loading safetensors checkpoint shards:  21% Completed | 4/19 [00:05<00:22,  1.47s/it]
      INFO 2025-03-14 02:38:09,813 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 7/120
      Loading safetensors checkpoint shards:  26% Completed | 5/19 [00:07<00:20,  1.49s/it]
      Loading safetensors checkpoint shards:  32% Completed | 6/19 [00:08<00:19,  1.50s/it]
      INFO 2025-03-14 02:38:13,108 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 8/120
      Loading safetensors checkpoint shards:  37% Completed | 7/19 [00:10<00:18,  1.51s/it]
      Loading safetensors checkpoint shards:  42% Completed | 8/19 [00:11<00:16,  1.50s/it]
      INFO 2025-03-14 02:38:16,440 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 9/120
      Loading safetensors checkpoint shards:  47% Completed | 9/19 [00:13<00:14,  1.44s/it]
      Loading safetensors checkpoint shards:  53% Completed | 10/19 [00:14<00:13,  1.47s/it]
      INFO 2025-03-14 02:38:19,645 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 10/120
      Loading safetensors checkpoint shards:  58% Completed | 11/19 [00:16<00:11,  1.49s/it]
      Loading safetensors checkpoint shards:  63% Completed | 12/19 [00:17<00:10,  1.50s/it]
      INFO 2025-03-14 02:38:22,838 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 11/120
      Loading safetensors checkpoint shards:  68% Completed | 13/19 [00:19<00:09,  1.50s/it]
      Loading safetensors checkpoint shards:  74% Completed | 14/19 [00:20<00:07,  1.51s/it]
      Loading safetensors checkpoint shards:  79% Completed | 15/19 [00:22<00:06,  1.50s/it]
      INFO 2025-03-14 02:38:26,323 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 12/120
      Loading safetensors checkpoint shards:  84% Completed | 16/19 [00:23<00:04,  1.50s/it]
      Loading safetensors checkpoint shards:  89% Completed | 17/19 [00:25<00:02,  1.50s/it]
      INFO 2025-03-14 02:38:29,657 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 13/120
      Loading safetensors checkpoint shards:  95% Completed | 18/19 [00:26<00:01,  1.50s/it]
      Loading safetensors checkpoint shards: 100% Completed | 19/19 [00:28<00:00,  1.52s/it]
      Loading safetensors checkpoint shards: 100% Completed | 19/19 [00:28<00:00,  1.49s/it]
      
      INFO 03-14 02:38:32 model_runner.py:1095] Loading model weights took 87.0026 GB
      INFO 03-14 02:38:32 punica_selector.py:11] Using PunicaWrapperGPU.
      ERROR 03-14 02:38:32 engine.py:366] 
      Traceback (most recent call last):
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
          engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
          return cls(ipc_path=ipc_path,
                 ^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__
          self.engine = LLMEngine(*args, **kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 288, in __init__
          self.model_executor = executor_class(vllm_config=vllm_config, )
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
          super().__init__(*args, **kwargs)
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/executor_base.py", line 36, in __init__
          self._init_executor()
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 83, in _init_executor
          self._run_workers("load_model",
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 157, in _run_workers
          driver_worker_output = driver_worker_method(*args, **kwargs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 183, in load_model
          self.model_runner.load_model()
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 1124, in load_model
          self.model = self.lora_manager.create_lora_manager(self.model)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/worker_manager.py", line 174, in create_lora_manager
          lora_manager = create_lora_manager(
                         ^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 755, in create_lora_manager
          lora_manager = lora_manager_cls(
                         ^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 678, in __init__
          super().__init__(model, max_num_seqs, max_num_batched_tokens,
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 353, in __init__
          self._create_lora_modules()
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 507, in _create_lora_modules
          self.register_module(module_name, new_module)
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 513, in register_module
          assert isinstance(module, BaseLayerWithLoRA)
      AssertionError
      Process SpawnProcess-1:
      Traceback (most recent call last):
        File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
          self.run()
        File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
          self._target(*self._args, **self._kwargs)
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 368, in run_mp_engine
          raise e
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
          engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
          return cls(ipc_path=ipc_path,
                 ^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__
          self.engine = LLMEngine(*args, **kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 288, in __init__
          self.model_executor = executor_class(vllm_config=vllm_config, )
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
          super().__init__(*args, **kwargs)
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/executor_base.py", line 36, in __init__
          self._init_executor()
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 83, in _init_executor
          self._run_workers("load_model",
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 157, in _run_workers
          driver_worker_output = driver_worker_method(*args, **kwargs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 183, in load_model
          self.model_runner.load_model()
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 1124, in load_model
          self.model = self.lora_manager.create_lora_manager(self.model)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/worker_manager.py", line 174, in create_lora_manager
          lora_manager = create_lora_manager(
                         ^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 755, in create_lora_manager
          lora_manager = lora_manager_cls(
                         ^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 678, in __init__
          super().__init__(model, max_num_seqs, max_num_batched_tokens,
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 353, in __init__
          self._create_lora_modules()
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 507, in _create_lora_modules
          self.register_module(module_name, new_module)
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 513, in register_module
          assert isinstance(module, BaseLayerWithLoRA)
      AssertionError
      INFO 2025-03-14 02:38:32,800 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 14/120
      INFO 2025-03-14 02:38:36,098 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 15/120
      INFO 2025-03-14 02:38:39,498 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 16/120
      Task exception was never retrieved
      future: <Task finished name='Task-2' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
      Traceback (most recent call last):
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
          while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/zmq/_future.py", line 372, in poll
          raise _zmq.ZMQError(_zmq.ENOTSUP)
      zmq.error.ZMQError: Operation not supported
      Task exception was never retrieved
      future: <Task finished name='Task-3' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
      Traceback (most recent call last):
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
          while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/zmq/_future.py", line 372, in poll
          raise _zmq.ZMQError(_zmq.ENOTSUP)
      zmq.error.ZMQError: Operation not supported
      Task exception was never retrieved
      future: <Task finished name='Task-4' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
      Traceback (most recent call last):
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
          while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/zmq/_future.py", line 372, in poll
          raise _zmq.ZMQError(_zmq.ENOTSUP)
      zmq.error.ZMQError: Operation not supported
      Traceback (most recent call last):
        File "<frozen runpy>", line 198, in _run_module_as_main
        File "<frozen runpy>", line 88, in _run_code
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 701, in <module>
          uvloop.run(run_server(args))
        File "/opt/app-root/lib64/python3.11/site-packages/uvloop/__init__.py", line 105, in run
          return runner.run(wrapper())
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/usr/lib64/python3.11/asyncio/runners.py", line 118, in run
          return self._loop.run_until_complete(task)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
        File "/opt/app-root/lib64/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
          return await main
                 ^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 667, in run_server
          async with build_async_engine_client(args) as engine_client:
        File "/usr/lib64/python3.11/contextlib.py", line 210, in __aenter__
          return await anext(self.gen)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 117, in build_async_engine_client
          async with build_async_engine_client_from_engine_args(
        File "/usr/lib64/python3.11/contextlib.py", line 210, in __aenter__
          return await anext(self.gen)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 222, in build_async_engine_client_from_engine_args
          raise RuntimeError(
      RuntimeError: Engine process failed to start. See stack trace for the root cause.
      failed to generate data with exception: Failed to start server: vLLM failed to start.
      

      Expected behavior

      • Generate should run

      Screenshots

      • Attached Image

      Device Info (please complete the following information):

      • Hardware Specs: [e.g. Apple M2 Pro Chip, 16 GB Memory, etc.]

      mi300x x8

      • OS Version: [e.g. Mac OS 14.4.1, Fedora Linux 40]
      • InstructLab Version: [output of \\\{{{}ilab --version{}}}]
        [cloud-user@lab-mi300x cloud-user]$ ilab --version
        ilab, version 0.23.3

      "registry.stage.redhat.io/rhelai1/bootc-azure-amd-rhel9:1.4"

        • ilab system info to print detailed information about InstructLab version, OS, and hardware – including GPU / AI accelerator hardware

      Platform:
      sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
      sys.platform: linux
      os.name: posix
      platform.release: 5.14.0-427.55.1.el9_4.x86_64
      platform.machine: x86_64
      platform.node: lab-mi300x
      platform.python_version: 3.11.7
      os-release.ID: rhel
      os-release.VERSION_ID: 9.4
      os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
      memory.total: 1820.96 GB
      memory.available: 1782.84 GB
      memory.used: 29.43 GB

      InstructLab:
      instructlab.version: 0.23.3
      instructlab-dolomite.version: 0.2.0
      instructlab-eval.version: 0.5.1
      instructlab-quantize.version: 0.1.0
      instructlab-schema.version: 0.4.2
      instructlab-sdg.version: 0.7.1
      instructlab-training.version: 0.7.0

      Torch:
      torch.version: 2.4.1
      torch.backends.cpu.capability: AVX512
      torch.version.cuda: None
      torch.version.hip: 6.2.41134-65d174c3e
      torch.cuda.available: True
      torch.backends.cuda.is_built: True
      torch.backends.mps.is_built: False
      torch.backends.mps.is_available: False
      torch.cuda.bf16: True
      torch.cuda.current.device: 0
      torch.cuda.0.name: AMD Radeon Graphics
      torch.cuda.0.free: 191.0 GB
      torch.cuda.0.total: 191.5 GB
      torch.cuda.0.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
      torch.cuda.1.name: AMD Radeon Graphics
      torch.cuda.1.free: 191.0 GB
      torch.cuda.1.total: 191.5 GB
      torch.cuda.1.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
      torch.cuda.2.name: AMD Radeon Graphics
      torch.cuda.2.free: 191.0 GB
      torch.cuda.2.total: 191.5 GB
      torch.cuda.2.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
      torch.cuda.3.name: AMD Radeon Graphics
      torch.cuda.3.free: 191.0 GB
      torch.cuda.3.total: 191.5 GB
      torch.cuda.3.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
      torch.cuda.4.name: AMD Radeon Graphics
      torch.cuda.4.free: 191.0 GB
      torch.cuda.4.total: 191.5 GB
      torch.cuda.4.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
      torch.cuda.5.name: AMD Radeon Graphics
      torch.cuda.5.free: 191.0 GB
      torch.cuda.5.total: 191.5 GB
      torch.cuda.5.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
      torch.cuda.6.name: AMD Radeon Graphics
      torch.cuda.6.free: 191.0 GB
      torch.cuda.6.total: 191.5 GB
      torch.cuda.6.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
      torch.cuda.7.name: AMD Radeon Graphics
      torch.cuda.7.free: 191.0 GB
      torch.cuda.7.total: 191.5 GB
      torch.cuda.7.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)

      llama_cpp_python:
      llama_cpp_python.version: 0.3.2
      llama_cpp_python.supports_gpu_offload: False

      Bug impact

      • Please provide information on the impact of this bug to the end user.

      Can't generate on amd

      Known workaround

      • Please add any known workarounds.

      Additional context

              Unassigned Unassigned
              dmcphers@redhat.com Dan McPherson
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: