-
Bug
-
Resolution: Duplicate
-
Major
-
rhelai-1.4.3
-
None
-
False
-
-
False
-
-
To Reproduce Steps to reproduce the behavior:
- Init with the mi300x x 8 profile
- Run `ilab data generate` on AMD mi300x x8
Example output:
[cloud-user@lab-mi300x cloud-user]$ ilab data generate --enable-serving-output INFO 2025-03-14 02:36:50,453 instructlab.process.process:241: Started subprocess with PID 1. Logs are being written to /mnt/instructlab/.local/share/instructlab/logs/generation/generation-2ecc2dd6-007d-11f0-a6ce-6045bd01a29c.log. INFO 2025-03-14 02:36:51,010 instructlab.model.backends.vllm:115: Trying to connect to model server at http://127.0.0.1:8000/v1 INFO 2025-03-14 02:36:52,248 instructlab.model.backends.vllm:332: vLLM starting up on pid 5 at http://127.0.0.1:55565/v1 INFO 2025-03-14 02:36:52,248 instructlab.model.backends.vllm:123: Starting a temporary vLLM server at http://127.0.0.1:55565/v1 INFO 2025-03-14 02:36:52,248 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 1/120 INFO 2025-03-14 02:36:55,545 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 2/120 WARNING 03-14 02:36:57 rocm.py:34] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead. /opt/app-root/lib64/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash: No module named 'vllm._version' from vllm.version import __version__ as VLLM_VERSION INFO 03-14 02:36:58 api_server.py:643] vLLM API server version 0.6.4.post1 INFO 03-14 02:36:58 api_server.py:644] args: Namespace(host='127.0.0.1', port=55565, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=[LoRAModulePath(name='skill-classifier-v3-clm', path='/mnt/instructlab/.cache/instructlab/models/skills-adapter-v3', base_model_name=None), LoRAModulePath(name='text-classifier-knowledge-v3-clm', path='/mnt/instructlab/.cache/instructlab/models/knowledge-adapter-v3', base_model_name=None)], prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/mnt/instructlab/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='bfloat16', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend='mp', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=True, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=512, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, mm_cache_preprocessor=False, enable_lora=True, enable_lora_bias=False, max_loras=1, max_lora_rank=64, lora_extra_vocab_size=256, lora_dtype='bfloat16', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=True, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False) INFO 03-14 02:36:58 api_server.py:198] Started engine process with PID 25 INFO 2025-03-14 02:36:58,977 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 3/120 /opt/app-root/lib64/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash: No module named 'vllm._version' from vllm.version import __version__ as VLLM_VERSION INFO 2025-03-14 02:37:02,409 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 4/120 INFO 03-14 02:37:03 config.py:444] This model supports multiple tasks: {'embed', 'score', 'reward', 'classify', 'generate'}. Defaulting to 'generate'. INFO 2025-03-14 02:37:05,682 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 5/120 INFO 03-14 02:37:06 config.py:444] This model supports multiple tasks: {'reward', 'classify', 'generate', 'score', 'embed'}. Defaulting to 'generate'. INFO 03-14 02:37:07 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='/mnt/instructlab/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', speculative_config=None, tokenizer='/mnt/instructlab/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/mnt/instructlab/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}, use_cached_outputs=True, WARNING 03-14 02:37:07 multiproc_worker_utils.py:312] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 03-14 02:37:08 selector.py:134] Using ROCmFlashAttention backend. INFO 03-14 02:37:08 model_runner.py:1090] Starting to load model /mnt/instructlab/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1... WARNING 03-14 02:37:08 registry.py:315] Model architecture 'MixtralForCausalLM' is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting `VLLM_USE_TRITON_FLASH_ATTN=0` Loading safetensors checkpoint shards: 0% Completed | 0/19 [00:00<?, ?it/s] INFO 2025-03-14 02:37:09,101 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 6/120 Loading safetensors checkpoint shards: 5% Completed | 1/19 [00:01<00:27, 1.52s/it] Loading safetensors checkpoint shards: 11% Completed | 2/19 [00:03<00:27, 1.60s/it] INFO 2025-03-14 02:37:12,316 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 7/120 Loading safetensors checkpoint shards: 16% Completed | 3/19 [00:04<00:26, 1.63s/it] Loading safetensors checkpoint shards: 21% Completed | 4/19 [00:06<00:24, 1.64s/it] INFO 2025-03-14 02:37:15,695 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 8/120 Loading safetensors checkpoint shards: 26% Completed | 5/19 [00:08<00:23, 1.66s/it] Loading safetensors checkpoint shards: 32% Completed | 6/19 [00:09<00:21, 1.68s/it] INFO 2025-03-14 02:37:19,140 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 9/120 Loading safetensors checkpoint shards: 37% Completed | 7/19 [00:11<00:20, 1.68s/it] Loading safetensors checkpoint shards: 42% Completed | 8/19 [00:13<00:18, 1.68s/it] INFO 2025-03-14 02:37:22,486 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 10/120 Loading safetensors checkpoint shards: 47% Completed | 9/19 [00:14<00:16, 1.61s/it] Loading safetensors checkpoint shards: 53% Completed | 10/19 [00:16<00:14, 1.63s/it] INFO 2025-03-14 02:37:25,671 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 11/120 Loading safetensors checkpoint shards: 58% Completed | 11/19 [00:18<00:13, 1.66s/it] Loading safetensors checkpoint shards: 63% Completed | 12/19 [00:19<00:11, 1.66s/it] INFO 2025-03-14 02:37:28,891 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 12/120 Loading safetensors checkpoint shards: 68% Completed | 13/19 [00:21<00:09, 1.67s/it] Loading safetensors checkpoint shards: 74% Completed | 14/19 [00:23<00:08, 1.66s/it] INFO 2025-03-14 02:37:32,264 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 13/120 Loading safetensors checkpoint shards: 79% Completed | 15/19 [00:24<00:06, 1.68s/it] Loading safetensors checkpoint shards: 84% Completed | 16/19 [00:26<00:05, 1.67s/it] INFO 2025-03-14 02:37:35,710 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 14/120 Loading safetensors checkpoint shards: 89% Completed | 17/19 [00:28<00:03, 1.68s/it] Loading safetensors checkpoint shards: 95% Completed | 18/19 [00:29<00:01, 1.68s/it] INFO 2025-03-14 02:37:38,883 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 15/120 Loading safetensors checkpoint shards: 100% Completed | 19/19 [00:31<00:00, 1.69s/it] Loading safetensors checkpoint shards: 100% Completed | 19/19 [00:31<00:00, 1.66s/it] INFO 03-14 02:37:40 model_runner.py:1095] Loading model weights took 87.0026 GB INFO 03-14 02:37:40 punica_selector.py:11] Using PunicaWrapperGPU. ERROR 03-14 02:37:40 engine.py:366] Traceback (most recent call last): File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine engine = MQLLMEngine.from_engine_args(engine_args=engine_args, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args return cls(ipc_path=ipc_path, ^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__ self.engine = LLMEngine(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 288, in __init__ self.model_executor = executor_class(vllm_config=vllm_config, ) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__ super().__init__(*args, **kwargs) File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/executor_base.py", line 36, in __init__ self._init_executor() File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 83, in _init_executor self._run_workers("load_model", File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 157, in _run_workers driver_worker_output = driver_worker_method(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 183, in load_model self.model_runner.load_model() File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 1124, in load_model self.model = self.lora_manager.create_lora_manager(self.model) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/worker_manager.py", line 174, in create_lora_manager lora_manager = create_lora_manager( ^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 755, in create_lora_manager lora_manager = lora_manager_cls( ^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 678, in __init__ super().__init__(model, max_num_seqs, max_num_batched_tokens, File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 353, in __init__ self._create_lora_modules() File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 507, in _create_lora_modules self.register_module(module_name, new_module) File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 513, in register_module assert isinstance(module, BaseLayerWithLoRA) AssertionError Process SpawnProcess-1: Traceback (most recent call last): File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 368, in run_mp_engine raise e File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine engine = MQLLMEngine.from_engine_args(engine_args=engine_args, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args return cls(ipc_path=ipc_path, ^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__ self.engine = LLMEngine(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 288, in __init__ self.model_executor = executor_class(vllm_config=vllm_config, ) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__ super().__init__(*args, **kwargs) File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/executor_base.py", line 36, in __init__ self._init_executor() File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 83, in _init_executor self._run_workers("load_model", File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 157, in _run_workers driver_worker_output = driver_worker_method(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 183, in load_model self.model_runner.load_model() File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 1124, in load_model self.model = self.lora_manager.create_lora_manager(self.model) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/worker_manager.py", line 174, in create_lora_manager lora_manager = create_lora_manager( ^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 755, in create_lora_manager lora_manager = lora_manager_cls( ^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 678, in __init__ super().__init__(model, max_num_seqs, max_num_batched_tokens, File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 353, in __init__ self._create_lora_modules() File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 507, in _create_lora_modules self.register_module(module_name, new_module) File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 513, in register_module assert isinstance(module, BaseLayerWithLoRA) AssertionError INFO 2025-03-14 02:37:42,149 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 16/120 Task exception was never retrieved future: <Task finished name='Task-2' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/zmq/_future.py", line 372, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported Task exception was never retrieved future: <Task finished name='Task-3' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/zmq/_future.py", line 372, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported Task exception was never retrieved future: <Task finished name='Task-4' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/zmq/_future.py", line 372, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 701, in <module> uvloop.run(run_server(args)) File "/opt/app-root/lib64/python3.11/site-packages/uvloop/__init__.py", line 105, in run return runner.run(wrapper()) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib64/python3.11/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/opt/app-root/lib64/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper return await main ^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 667, in run_server async with build_async_engine_client(args) as engine_client: File "/usr/lib64/python3.11/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 117, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib64/python3.11/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 222, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start. See stack trace for the root cause. INFO 2025-03-14 02:37:45,380 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:55565/v1, this might take a moment... Attempt: 17/120 INFO 2025-03-14 02:37:48,751 instructlab.model.backends.vllm:180: vLLM startup failed. Retrying (1/1) ERROR 2025-03-14 02:37:48,751 instructlab.model.backends.vllm:185: vLLM failed to start. INFO 2025-03-14 02:37:48,751 instructlab.model.backends.vllm:115: Trying to connect to model server at http://127.0.0.1:8000/v1 INFO 2025-03-14 02:37:50,130 instructlab.model.backends.vllm:332: vLLM starting up on pid 179 at http://127.0.0.1:60397/v1 INFO 2025-03-14 02:37:50,131 instructlab.model.backends.vllm:123: Starting a temporary vLLM server at http://127.0.0.1:60397/v1 INFO 2025-03-14 02:37:50,131 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 1/120 WARNING 03-14 02:37:52 rocm.py:34] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead. /opt/app-root/lib64/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash: No module named 'vllm._version' from vllm.version import __version__ as VLLM_VERSION INFO 2025-03-14 02:37:53,334 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 2/120 INFO 03-14 02:37:53 api_server.py:643] vLLM API server version 0.6.4.post1 INFO 03-14 02:37:53 api_server.py:644] args: Namespace(host='127.0.0.1', port=60397, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=[LoRAModulePath(name='skill-classifier-v3-clm', path='/mnt/instructlab/.cache/instructlab/models/skills-adapter-v3', base_model_name=None), LoRAModulePath(name='text-classifier-knowledge-v3-clm', path='/mnt/instructlab/.cache/instructlab/models/knowledge-adapter-v3', base_model_name=None)], prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/mnt/instructlab/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='bfloat16', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend='mp', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=True, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=512, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, mm_cache_preprocessor=False, enable_lora=True, enable_lora_bias=False, max_loras=1, max_lora_rank=64, lora_extra_vocab_size=256, lora_dtype='bfloat16', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=True, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False) INFO 03-14 02:37:53 api_server.py:198] Started engine process with PID 199 /opt/app-root/lib64/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash: No module named 'vllm._version' from vllm.version import __version__ as VLLM_VERSION INFO 2025-03-14 02:37:56,792 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 3/120 INFO 03-14 02:37:58 config.py:444] This model supports multiple tasks: {'score', 'classify', 'generate', 'reward', 'embed'}. Defaulting to 'generate'. INFO 2025-03-14 02:38:00,171 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 4/120 INFO 03-14 02:38:01 config.py:444] This model supports multiple tasks: {'score', 'reward', 'embed', 'classify', 'generate'}. Defaulting to 'generate'. INFO 03-14 02:38:03 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='/mnt/instructlab/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', speculative_config=None, tokenizer='/mnt/instructlab/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/mnt/instructlab/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}, use_cached_outputs=True, WARNING 03-14 02:38:03 multiproc_worker_utils.py:312] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 03-14 02:38:03 selector.py:134] Using ROCmFlashAttention backend. INFO 2025-03-14 02:38:03,499 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 5/120 INFO 03-14 02:38:03 model_runner.py:1090] Starting to load model /mnt/instructlab/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1... WARNING 03-14 02:38:03 registry.py:315] Model architecture 'MixtralForCausalLM' is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting `VLLM_USE_TRITON_FLASH_ATTN=0` Loading safetensors checkpoint shards: 0% Completed | 0/19 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 5% Completed | 1/19 [00:01<00:24, 1.34s/it] Loading safetensors checkpoint shards: 11% Completed | 2/19 [00:02<00:23, 1.41s/it] INFO 2025-03-14 02:38:06,664 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 6/120 Loading safetensors checkpoint shards: 16% Completed | 3/19 [00:04<00:22, 1.43s/it] Loading safetensors checkpoint shards: 21% Completed | 4/19 [00:05<00:22, 1.47s/it] INFO 2025-03-14 02:38:09,813 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 7/120 Loading safetensors checkpoint shards: 26% Completed | 5/19 [00:07<00:20, 1.49s/it] Loading safetensors checkpoint shards: 32% Completed | 6/19 [00:08<00:19, 1.50s/it] INFO 2025-03-14 02:38:13,108 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 8/120 Loading safetensors checkpoint shards: 37% Completed | 7/19 [00:10<00:18, 1.51s/it] Loading safetensors checkpoint shards: 42% Completed | 8/19 [00:11<00:16, 1.50s/it] INFO 2025-03-14 02:38:16,440 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 9/120 Loading safetensors checkpoint shards: 47% Completed | 9/19 [00:13<00:14, 1.44s/it] Loading safetensors checkpoint shards: 53% Completed | 10/19 [00:14<00:13, 1.47s/it] INFO 2025-03-14 02:38:19,645 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 10/120 Loading safetensors checkpoint shards: 58% Completed | 11/19 [00:16<00:11, 1.49s/it] Loading safetensors checkpoint shards: 63% Completed | 12/19 [00:17<00:10, 1.50s/it] INFO 2025-03-14 02:38:22,838 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 11/120 Loading safetensors checkpoint shards: 68% Completed | 13/19 [00:19<00:09, 1.50s/it] Loading safetensors checkpoint shards: 74% Completed | 14/19 [00:20<00:07, 1.51s/it] Loading safetensors checkpoint shards: 79% Completed | 15/19 [00:22<00:06, 1.50s/it] INFO 2025-03-14 02:38:26,323 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 12/120 Loading safetensors checkpoint shards: 84% Completed | 16/19 [00:23<00:04, 1.50s/it] Loading safetensors checkpoint shards: 89% Completed | 17/19 [00:25<00:02, 1.50s/it] INFO 2025-03-14 02:38:29,657 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 13/120 Loading safetensors checkpoint shards: 95% Completed | 18/19 [00:26<00:01, 1.50s/it] Loading safetensors checkpoint shards: 100% Completed | 19/19 [00:28<00:00, 1.52s/it] Loading safetensors checkpoint shards: 100% Completed | 19/19 [00:28<00:00, 1.49s/it] INFO 03-14 02:38:32 model_runner.py:1095] Loading model weights took 87.0026 GB INFO 03-14 02:38:32 punica_selector.py:11] Using PunicaWrapperGPU. ERROR 03-14 02:38:32 engine.py:366] Traceback (most recent call last): File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine engine = MQLLMEngine.from_engine_args(engine_args=engine_args, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args return cls(ipc_path=ipc_path, ^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__ self.engine = LLMEngine(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 288, in __init__ self.model_executor = executor_class(vllm_config=vllm_config, ) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__ super().__init__(*args, **kwargs) File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/executor_base.py", line 36, in __init__ self._init_executor() File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 83, in _init_executor self._run_workers("load_model", File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 157, in _run_workers driver_worker_output = driver_worker_method(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 183, in load_model self.model_runner.load_model() File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 1124, in load_model self.model = self.lora_manager.create_lora_manager(self.model) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/worker_manager.py", line 174, in create_lora_manager lora_manager = create_lora_manager( ^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 755, in create_lora_manager lora_manager = lora_manager_cls( ^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 678, in __init__ super().__init__(model, max_num_seqs, max_num_batched_tokens, File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 353, in __init__ self._create_lora_modules() File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 507, in _create_lora_modules self.register_module(module_name, new_module) File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 513, in register_module assert isinstance(module, BaseLayerWithLoRA) AssertionError Process SpawnProcess-1: Traceback (most recent call last): File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 368, in run_mp_engine raise e File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine engine = MQLLMEngine.from_engine_args(engine_args=engine_args, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args return cls(ipc_path=ipc_path, ^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__ self.engine = LLMEngine(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 288, in __init__ self.model_executor = executor_class(vllm_config=vllm_config, ) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__ super().__init__(*args, **kwargs) File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/executor_base.py", line 36, in __init__ self._init_executor() File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 83, in _init_executor self._run_workers("load_model", File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 157, in _run_workers driver_worker_output = driver_worker_method(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 183, in load_model self.model_runner.load_model() File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 1124, in load_model self.model = self.lora_manager.create_lora_manager(self.model) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/worker_manager.py", line 174, in create_lora_manager lora_manager = create_lora_manager( ^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 755, in create_lora_manager lora_manager = lora_manager_cls( ^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 678, in __init__ super().__init__(model, max_num_seqs, max_num_batched_tokens, File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 353, in __init__ self._create_lora_modules() File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 507, in _create_lora_modules self.register_module(module_name, new_module) File "/opt/app-root/lib64/python3.11/site-packages/vllm/lora/models.py", line 513, in register_module assert isinstance(module, BaseLayerWithLoRA) AssertionError INFO 2025-03-14 02:38:32,800 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 14/120 INFO 2025-03-14 02:38:36,098 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 15/120 INFO 2025-03-14 02:38:39,498 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:60397/v1, this might take a moment... Attempt: 16/120 Task exception was never retrieved future: <Task finished name='Task-2' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/zmq/_future.py", line 372, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported Task exception was never retrieved future: <Task finished name='Task-3' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/zmq/_future.py", line 372, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported Task exception was never retrieved future: <Task finished name='Task-4' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/zmq/_future.py", line 372, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 701, in <module> uvloop.run(run_server(args)) File "/opt/app-root/lib64/python3.11/site-packages/uvloop/__init__.py", line 105, in run return runner.run(wrapper()) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib64/python3.11/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/opt/app-root/lib64/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper return await main ^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 667, in run_server async with build_async_engine_client(args) as engine_client: File "/usr/lib64/python3.11/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 117, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib64/python3.11/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 222, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start. See stack trace for the root cause. failed to generate data with exception: Failed to start server: vLLM failed to start.
Expected behavior
- Generate should run
Screenshots
- Attached Image
Device Info (please complete the following information):
- Hardware Specs: [e.g. Apple M2 Pro Chip, 16 GB Memory, etc.]
mi300x x8
- OS Version: [e.g. Mac OS 14.4.1, Fedora Linux 40]
- InstructLab Version: [output of \\\{{{}ilab --version{}}}]
[cloud-user@lab-mi300x cloud-user]$ ilab --version
ilab, version 0.23.3
- Provide the output of these two commands:
- sudo bootc status --format json | jq .status.booted.image.image.image to print the name and tag of the bootc image, should look like registry.stage.redhat.io/rhelai1/bootc-intel-rhel9:1.3-1732894187
"registry.stage.redhat.io/rhelai1/bootc-azure-amd-rhel9:1.4"
-
- ilab system info to print detailed information about InstructLab version, OS, and hardware – including GPU / AI accelerator hardware
Platform:
sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
sys.platform: linux
os.name: posix
platform.release: 5.14.0-427.55.1.el9_4.x86_64
platform.machine: x86_64
platform.node: lab-mi300x
platform.python_version: 3.11.7
os-release.ID: rhel
os-release.VERSION_ID: 9.4
os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
memory.total: 1820.96 GB
memory.available: 1782.84 GB
memory.used: 29.43 GB
InstructLab:
instructlab.version: 0.23.3
instructlab-dolomite.version: 0.2.0
instructlab-eval.version: 0.5.1
instructlab-quantize.version: 0.1.0
instructlab-schema.version: 0.4.2
instructlab-sdg.version: 0.7.1
instructlab-training.version: 0.7.0
Torch:
torch.version: 2.4.1
torch.backends.cpu.capability: AVX512
torch.version.cuda: None
torch.version.hip: 6.2.41134-65d174c3e
torch.cuda.available: True
torch.backends.cuda.is_built: True
torch.backends.mps.is_built: False
torch.backends.mps.is_available: False
torch.cuda.bf16: True
torch.cuda.current.device: 0
torch.cuda.0.name: AMD Radeon Graphics
torch.cuda.0.free: 191.0 GB
torch.cuda.0.total: 191.5 GB
torch.cuda.0.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.1.name: AMD Radeon Graphics
torch.cuda.1.free: 191.0 GB
torch.cuda.1.total: 191.5 GB
torch.cuda.1.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.2.name: AMD Radeon Graphics
torch.cuda.2.free: 191.0 GB
torch.cuda.2.total: 191.5 GB
torch.cuda.2.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.3.name: AMD Radeon Graphics
torch.cuda.3.free: 191.0 GB
torch.cuda.3.total: 191.5 GB
torch.cuda.3.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.4.name: AMD Radeon Graphics
torch.cuda.4.free: 191.0 GB
torch.cuda.4.total: 191.5 GB
torch.cuda.4.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.5.name: AMD Radeon Graphics
torch.cuda.5.free: 191.0 GB
torch.cuda.5.total: 191.5 GB
torch.cuda.5.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.6.name: AMD Radeon Graphics
torch.cuda.6.free: 191.0 GB
torch.cuda.6.total: 191.5 GB
torch.cuda.6.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.7.name: AMD Radeon Graphics
torch.cuda.7.free: 191.0 GB
torch.cuda.7.total: 191.5 GB
torch.cuda.7.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
llama_cpp_python:
llama_cpp_python.version: 0.3.2
llama_cpp_python.supports_gpu_offload: False
Bug impact
- Please provide information on the impact of this bug to the end user.
Can't generate on amd
Known workaround
- Please add any known workarounds.
Additional context
- Full logs from the run: https://gitlab.com/redhat/rhel-ai/diip/-/jobs/9408738354
- duplicates
-
AIPCC-726 RHEL AI 1.4.3 - vllm fails to start on AMD for SDG - Ready to be tested
-
- Closed
-
- mentioned on