Loading...

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: RHELAI 1.3 GA
Affects Version/s: None
Component/s: Accelerators - AMD
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Release Blocker:
Approved

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

A cuda kernel failure reported by tflink below indicates a possible arch problem:

Error code 98 == invalid device function

(app-root) /$ export VLLM_USE_TRITON_FLASH_ATTN=0
(app-root) /$ ilab model serve
INFO 2024-11-27 02:55:22,827 instructlab.model.serve_backend:56: Using model '/opt/app-root/src/.cache/instructlab/models/granite-8b-lab-v1' with -1 gpu-layers and 4096 max context size.
INFO 2024-11-27 02:55:22,829 instructlab.model.backends.vllm:313: vLLM starting up on pid 1515 at http://127.0.0.1:8000/v1
WARNING 11-27 02:55:24 rocm.py:17] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead.
/opt/app-root/lib64/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
INFO 11-27 02:55:26 api_server.py:526] vLLM API server version 0.6.2
INFO 11-27 02:55:26 api_server.py:527] args: Namespace(host='127.0.0.1', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template='/tmp/tmp516a2rel', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/opt/app-root/src/.cache/instructlab/models/granite-8b-lab-v1', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend='mp', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 11-27 02:55:26 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/ce023fd5-ec71-4440-9c37-f21252811f3a for IPC Path.
INFO 11-27 02:55:26 api_server.py:177] Started engine process with PID 1538
INFO 11-27 02:55:26 config.py:1659] Downcasting torch.float32 to torch.float16.
/opt/app-root/lib64/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
INFO 11-27 02:55:29 config.py:1659] Downcasting torch.float32 to torch.float16.
INFO 11-27 02:55:29 llm_engine.py:226] Initializing an LLM engine (v0.6.2) with config: model='/opt/app-root/src/.cache/instructlab/models/granite-8b-lab-v1', speculative_config=None, tokenizer='/opt/app-root/src/.cache/instructlab/models/granite-8b-lab-v1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/opt/app-root/src/.cache/instructlab/models/granite-8b-lab-v1, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
WARNING 11-27 02:55:29 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 11-27 02:55:29 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 11-27 02:55:30 selector.py:121] Using ROCmFlashAttention backend.
/opt/app-root/lib64/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
/opt/app-root/lib64/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
/opt/app-root/lib64/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
(VllmWorkerProcess pid=1626) INFO 11-27 02:55:33 selector.py:121] Using ROCmFlashAttention backend.
(VllmWorkerProcess pid=1626) INFO 11-27 02:55:33 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
(VllmWorkerProcess pid=1624) INFO 11-27 02:55:33 selector.py:121] Using ROCmFlashAttention backend.
(VllmWorkerProcess pid=1624) INFO 11-27 02:55:33 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
(VllmWorkerProcess pid=1625) INFO 11-27 02:55:33 selector.py:121] Using ROCmFlashAttention backend.
(VllmWorkerProcess pid=1625) INFO 11-27 02:55:33 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
(VllmWorkerProcess pid=1626) INFO 11-27 02:55:33 utils.py:1103] Found nccl from library librccl.so.1
INFO 11-27 02:55:33 utils.py:1103] Found nccl from library librccl.so.1
(VllmWorkerProcess pid=1626) INFO 11-27 02:55:33 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 11-27 02:55:33 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=1625) INFO 11-27 02:55:33 utils.py:1103] Found nccl from library librccl.so.1
(VllmWorkerProcess pid=1625) INFO 11-27 02:55:33 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=1624) INFO 11-27 02:55:33 utils.py:1103] Found nccl from library librccl.so.1
(VllmWorkerProcess pid=1624) INFO 11-27 02:55:33 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=1626) WARNING 11-27 02:55:34 custom_all_reduce.py:126] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=1624) WARNING 11-27 02:55:34 custom_all_reduce.py:126] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=1625) WARNING 11-27 02:55:34 custom_all_reduce.py:126] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 11-27 02:55:34 custom_all_reduce.py:126] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 11-27 02:55:34 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f1877467990>, local_subscribe_port=49627, remote_subscribe_port=None)
INFO 11-27 02:55:34 model_runner.py:1014] Starting to load model /opt/app-root/src/.cache/instructlab/models/granite-8b-lab-v1...
(VllmWorkerProcess pid=1625) INFO 11-27 02:55:34 model_runner.py:1014] Starting to load model /opt/app-root/src/.cache/instructlab/models/granite-8b-lab-v1...
(VllmWorkerProcess pid=1624) INFO 11-27 02:55:34 model_runner.py:1014] Starting to load model /opt/app-root/src/.cache/instructlab/models/granite-8b-lab-v1...
(VllmWorkerProcess pid=1626) INFO 11-27 02:55:34 model_runner.py:1014] Starting to load model /opt/app-root/src/.cache/instructlab/models/granite-8b-lab-v1...
INFO 11-27 02:55:34 selector.py:121] Using ROCmFlashAttention backend.
(VllmWorkerProcess pid=1624) INFO 11-27 02:55:34 selector.py:121] Using ROCmFlashAttention backend.
(VllmWorkerProcess pid=1625) INFO 11-27 02:55:34 selector.py:121] Using ROCmFlashAttention backend.
(VllmWorkerProcess pid=1626) INFO 11-27 02:55:34 selector.py:121] Using ROCmFlashAttention backend.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:05,  1.90s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:03<00:03,  1.95s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:05<00:01,  1.99s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00,  1.44s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00,  1.63s/it]
 
INFO 11-27 02:55:41 model_runner.py:1025] Loading model weights took 3.8455 GB
(VllmWorkerProcess pid=1626) INFO 11-27 02:55:41 model_runner.py:1025] Loading model weights took 3.8455 GB
(VllmWorkerProcess pid=1624) INFO 11-27 02:55:41 model_runner.py:1025] Loading model weights took 3.8455 GB
(VllmWorkerProcess pid=1625) INFO 11-27 02:55:41 model_runner.py:1025] Loading model weights took 3.8455 GB
INFO 11-27 02:55:42 distributed_gpu_executor.py:57] # GPU blocks: 85466, # CPU blocks: 6553
INFO 11-27 02:55:44 model_runner.py:1329] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 11-27 02:55:44 model_runner.py:1333] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=1625) INFO 11-27 02:55:44 model_runner.py:1329] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=1625) INFO 11-27 02:55:44 model_runner.py:1333] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=1624) INFO 11-27 02:55:44 model_runner.py:1329] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=1624) INFO 11-27 02:55:44 model_runner.py:1333] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=1626) INFO 11-27 02:55:45 model_runner.py:1329] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=1626) INFO 11-27 02:55:45 model_runner.py:1333] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233] Exception in worker VllmWorkerProcess while processing method initialize_cache: CUDA kernel failed : 98, Traceback (most recent call last):
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 226, in _run_worker_process
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]              ^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 294, in initialize_cache
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     self._warm_up_model()
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 310, in _warm_up_model
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     self.model_runner.capture_model(self.gpu_cache)
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return func(*args, **kwargs)
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 1448, in capture_model
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     graph_runner.capture(**capture_inputs)
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 1711, in capture
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     self.model(
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 422, in forward
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     model_output = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 323, in forward
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     hidden_states = layer(
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]                     ^^^^^^
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 242, in forward
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     hidden_states = self.self_attn(
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]                     ^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 172, in forward
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     qkv, _ = self.qkv_proj(hidden_states)
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 366, in forward
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 134, in apply
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return tgemm.mm(x, layer.weight, bias)
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/tuned_gemm.py", line 105, in mm
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     out = self.apply_skinny(m, n, k, inp_view, weights)
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/tuned_gemm.py", line 70, in apply_skinny
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     ops.wvSpltK(weights, inp_view, out, n, self.cu_count)
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/_custom_ops.py", line 38, in wrapper
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return fn(*args, **kwargs)
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/_custom_ops.py", line 973, in wvSpltK
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     torch.ops._rocm_C.wvSpltK(a, b, out, N, cu_count)
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/_ops.py", line 1061, in __call__
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self_._op(*args, **(kwargs or {}))
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233] RuntimeError: CUDA kernel failed : 98
(VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233] 
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233] Exception in worker VllmWorkerProcess while processing method initialize_cache: CUDA kernel failed : 98, Traceback (most recent call last):
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 226, in _run_worker_process
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]              ^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 294, in initialize_cache
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     self._warm_up_model()
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 310, in _warm_up_model
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     self.model_runner.capture_model(self.gpu_cache)
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return func(*args, **kwargs)
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 1448, in capture_model
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     graph_runner.capture(**capture_inputs)
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 1711, in capture
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     self.model(
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 422, in forward
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     model_output = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 323, in forward
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     hidden_states = layer(
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]                     ^^^^^^
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 242, in forward
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     hidden_states = self.self_attn(
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]                     ^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 172, in forward
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     qkv, _ = self.qkv_proj(hidden_states)
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 366, in forward
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 134, in apply
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return tgemm.mm(x, layer.weight, bias)
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/tuned_gemm.py", line 105, in mm
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     out = self.apply_skinny(m, n, k, inp_view, weights)
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/tuned_gemm.py", line 70, in apply_skinny
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     ops.wvSpltK(weights, inp_view, out, n, self.cu_count)
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/_custom_ops.py", line 38, in wrapper
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return fn(*args, **kwargs)
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/_custom_ops.py", line 973, in wvSpltK
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     torch.ops._rocm_C.wvSpltK(a, b, out, N, cu_count)
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/_ops.py", line 1061, in __call__
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self_._op(*args, **(kwargs or {}))
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233] RuntimeError: CUDA kernel failed : 98
(VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233] 
ERROR 11-27 02:55:58 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 1624 died, exit code: -15
INFO 11-27 02:55:58 multiproc_worker_utils.py:124] Killing local vLLM worker processes
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_args
    return cls(
           ^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
    self.engine = LLMEngine(*args,
                  ^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 339, in __init__
    self._initialize_kv_caches()
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 487, in _initialize_kv_caches
    self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 63, in initialize_cache
    self._run_workers("initialize_cache",
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers
    driver_worker_output = driver_worker_method(*args, **kwargs)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 294, in initialize_cache
    self._warm_up_model()
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 310, in _warm_up_model
    self.model_runner.capture_model(self.gpu_cache)
  File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 1448, in capture_model
    graph_runner.capture(**capture_inputs)
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 1711, in capture
    self.model(
  File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 422, in forward
    model_output = self.model(input_ids, positions, kv_caches,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 323, in forward
    hidden_states = layer(
                    ^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 242, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 172, in forward
    qkv, _ = self.qkv_proj(hidden_states)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 366, in forward
    output_parallel = self.quant_method.apply(self, input_, bias)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 134, in apply
    return tgemm.mm(x, layer.weight, bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/tuned_gemm.py", line 105, in mm
    out = self.apply_skinny(m, n, k, inp_view, weights)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/tuned_gemm.py", line 70, in apply_skinny
    ops.wvSpltK(weights, inp_view, out, n, self.cu_count)
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/_custom_ops.py", line 38, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/_custom_ops.py", line 973, in wvSpltK
    torch.ops._rocm_C.wvSpltK(a, b, out, N, cu_count)
  File "/opt/app-root/lib64/python3.11/site-packages/torch/_ops.py", line 1061, in __call__
    return self_._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA kernel failed : 98