Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-2414

AMD vLLM cuda kernel 98 failure

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Approved

      A cuda kernel failure reported  by tflink  below indicates a possible arch problem:

      Error code 98 == invalid device function

      (app-root) /$ export VLLM_USE_TRITON_FLASH_ATTN=0
      (app-root) /$ ilab model serve
      INFO 2024-11-27 02:55:22,827 instructlab.model.serve_backend:56: Using model '/opt/app-root/src/.cache/instructlab/models/granite-8b-lab-v1' with -1 gpu-layers and 4096 max context size.
      INFO 2024-11-27 02:55:22,829 instructlab.model.backends.vllm:313: vLLM starting up on pid 1515 at http://127.0.0.1:8000/v1
      WARNING 11-27 02:55:24 rocm.py:17] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead.
      /opt/app-root/lib64/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
      No module named 'vllm._version'
        from vllm.version import __version__ as VLLM_VERSION
      INFO 11-27 02:55:26 api_server.py:526] vLLM API server version 0.6.2
      INFO 11-27 02:55:26 api_server.py:527] args: Namespace(host='127.0.0.1', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template='/tmp/tmp516a2rel', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/opt/app-root/src/.cache/instructlab/models/granite-8b-lab-v1', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend='mp', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
      INFO 11-27 02:55:26 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/ce023fd5-ec71-4440-9c37-f21252811f3a for IPC Path.
      INFO 11-27 02:55:26 api_server.py:177] Started engine process with PID 1538
      INFO 11-27 02:55:26 config.py:1659] Downcasting torch.float32 to torch.float16.
      /opt/app-root/lib64/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
      No module named 'vllm._version'
        from vllm.version import __version__ as VLLM_VERSION
      INFO 11-27 02:55:29 config.py:1659] Downcasting torch.float32 to torch.float16.
      INFO 11-27 02:55:29 llm_engine.py:226] Initializing an LLM engine (v0.6.2) with config: model='/opt/app-root/src/.cache/instructlab/models/granite-8b-lab-v1', speculative_config=None, tokenizer='/opt/app-root/src/.cache/instructlab/models/granite-8b-lab-v1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/opt/app-root/src/.cache/instructlab/models/granite-8b-lab-v1, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
      WARNING 11-27 02:55:29 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
      INFO 11-27 02:55:29 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
      INFO 11-27 02:55:30 selector.py:121] Using ROCmFlashAttention backend.
      /opt/app-root/lib64/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
      No module named 'vllm._version'
        from vllm.version import __version__ as VLLM_VERSION
      /opt/app-root/lib64/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
      No module named 'vllm._version'
        from vllm.version import __version__ as VLLM_VERSION
      /opt/app-root/lib64/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
      No module named 'vllm._version'
        from vllm.version import __version__ as VLLM_VERSION
      (VllmWorkerProcess pid=1626) INFO 11-27 02:55:33 selector.py:121] Using ROCmFlashAttention backend.
      (VllmWorkerProcess pid=1626) INFO 11-27 02:55:33 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
      (VllmWorkerProcess pid=1624) INFO 11-27 02:55:33 selector.py:121] Using ROCmFlashAttention backend.
      (VllmWorkerProcess pid=1624) INFO 11-27 02:55:33 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
      (VllmWorkerProcess pid=1625) INFO 11-27 02:55:33 selector.py:121] Using ROCmFlashAttention backend.
      (VllmWorkerProcess pid=1625) INFO 11-27 02:55:33 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
      (VllmWorkerProcess pid=1626) INFO 11-27 02:55:33 utils.py:1103] Found nccl from library librccl.so.1
      INFO 11-27 02:55:33 utils.py:1103] Found nccl from library librccl.so.1
      (VllmWorkerProcess pid=1626) INFO 11-27 02:55:33 pynccl.py:63] vLLM is using nccl==2.20.5
      INFO 11-27 02:55:33 pynccl.py:63] vLLM is using nccl==2.20.5
      (VllmWorkerProcess pid=1625) INFO 11-27 02:55:33 utils.py:1103] Found nccl from library librccl.so.1
      (VllmWorkerProcess pid=1625) INFO 11-27 02:55:33 pynccl.py:63] vLLM is using nccl==2.20.5
      (VllmWorkerProcess pid=1624) INFO 11-27 02:55:33 utils.py:1103] Found nccl from library librccl.so.1
      (VllmWorkerProcess pid=1624) INFO 11-27 02:55:33 pynccl.py:63] vLLM is using nccl==2.20.5
      (VllmWorkerProcess pid=1626) WARNING 11-27 02:55:34 custom_all_reduce.py:126] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
      (VllmWorkerProcess pid=1624) WARNING 11-27 02:55:34 custom_all_reduce.py:126] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
      (VllmWorkerProcess pid=1625) WARNING 11-27 02:55:34 custom_all_reduce.py:126] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
      WARNING 11-27 02:55:34 custom_all_reduce.py:126] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
      INFO 11-27 02:55:34 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f1877467990>, local_subscribe_port=49627, remote_subscribe_port=None)
      INFO 11-27 02:55:34 model_runner.py:1014] Starting to load model /opt/app-root/src/.cache/instructlab/models/granite-8b-lab-v1...
      (VllmWorkerProcess pid=1625) INFO 11-27 02:55:34 model_runner.py:1014] Starting to load model /opt/app-root/src/.cache/instructlab/models/granite-8b-lab-v1...
      (VllmWorkerProcess pid=1624) INFO 11-27 02:55:34 model_runner.py:1014] Starting to load model /opt/app-root/src/.cache/instructlab/models/granite-8b-lab-v1...
      (VllmWorkerProcess pid=1626) INFO 11-27 02:55:34 model_runner.py:1014] Starting to load model /opt/app-root/src/.cache/instructlab/models/granite-8b-lab-v1...
      INFO 11-27 02:55:34 selector.py:121] Using ROCmFlashAttention backend.
      (VllmWorkerProcess pid=1624) INFO 11-27 02:55:34 selector.py:121] Using ROCmFlashAttention backend.
      (VllmWorkerProcess pid=1625) INFO 11-27 02:55:34 selector.py:121] Using ROCmFlashAttention backend.
      (VllmWorkerProcess pid=1626) INFO 11-27 02:55:34 selector.py:121] Using ROCmFlashAttention backend.
      Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
      Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:05,  1.90s/it]
      Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:03<00:03,  1.95s/it]
      Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:05<00:01,  1.99s/it]
      Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00,  1.44s/it]
      Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00,  1.63s/it]
       
      INFO 11-27 02:55:41 model_runner.py:1025] Loading model weights took 3.8455 GB
      (VllmWorkerProcess pid=1626) INFO 11-27 02:55:41 model_runner.py:1025] Loading model weights took 3.8455 GB
      (VllmWorkerProcess pid=1624) INFO 11-27 02:55:41 model_runner.py:1025] Loading model weights took 3.8455 GB
      (VllmWorkerProcess pid=1625) INFO 11-27 02:55:41 model_runner.py:1025] Loading model weights took 3.8455 GB
      INFO 11-27 02:55:42 distributed_gpu_executor.py:57] # GPU blocks: 85466, # CPU blocks: 6553
      INFO 11-27 02:55:44 model_runner.py:1329] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
      INFO 11-27 02:55:44 model_runner.py:1333] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
      (VllmWorkerProcess pid=1625) INFO 11-27 02:55:44 model_runner.py:1329] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
      (VllmWorkerProcess pid=1625) INFO 11-27 02:55:44 model_runner.py:1333] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
      (VllmWorkerProcess pid=1624) INFO 11-27 02:55:44 model_runner.py:1329] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
      (VllmWorkerProcess pid=1624) INFO 11-27 02:55:44 model_runner.py:1333] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
      (VllmWorkerProcess pid=1626) INFO 11-27 02:55:45 model_runner.py:1329] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
      (VllmWorkerProcess pid=1626) INFO 11-27 02:55:45 model_runner.py:1333] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233] Exception in worker VllmWorkerProcess while processing method initialize_cache: CUDA kernel failed : 98, Traceback (most recent call last):
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 226, in _run_worker_process
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     output = executor(*args, **kwargs)
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]              ^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 294, in initialize_cache
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     self._warm_up_model()
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 310, in _warm_up_model
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     self.model_runner.capture_model(self.gpu_cache)
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return func(*args, **kwargs)
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 1448, in capture_model
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     graph_runner.capture(**capture_inputs)
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 1711, in capture
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     self.model(
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 422, in forward
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     model_output = self.model(input_ids, positions, kv_caches,
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 323, in forward
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     hidden_states = layer(
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]                     ^^^^^^
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 242, in forward
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     hidden_states = self.self_attn(
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]                     ^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 172, in forward
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     qkv, _ = self.qkv_proj(hidden_states)
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 366, in forward
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     output_parallel = self.quant_method.apply(self, input_, bias)
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 134, in apply
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return tgemm.mm(x, layer.weight, bias)
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/tuned_gemm.py", line 105, in mm
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     out = self.apply_skinny(m, n, k, inp_view, weights)
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/tuned_gemm.py", line 70, in apply_skinny
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     ops.wvSpltK(weights, inp_view, out, n, self.cu_count)
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/_custom_ops.py", line 38, in wrapper
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return fn(*args, **kwargs)
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/_custom_ops.py", line 973, in wvSpltK
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     torch.ops._rocm_C.wvSpltK(a, b, out, N, cu_count)
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/_ops.py", line 1061, in __call__
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self_._op(*args, **(kwargs or {}))
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233] RuntimeError: CUDA kernel failed : 98
      (VllmWorkerProcess pid=1626) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233] 
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233] Exception in worker VllmWorkerProcess while processing method initialize_cache: CUDA kernel failed : 98, Traceback (most recent call last):
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 226, in _run_worker_process
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     output = executor(*args, **kwargs)
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]              ^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 294, in initialize_cache
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     self._warm_up_model()
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 310, in _warm_up_model
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     self.model_runner.capture_model(self.gpu_cache)
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return func(*args, **kwargs)
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 1448, in capture_model
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     graph_runner.capture(**capture_inputs)
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 1711, in capture
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     self.model(
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 422, in forward
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     model_output = self.model(input_ids, positions, kv_caches,
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 323, in forward
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     hidden_states = layer(
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]                     ^^^^^^
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 242, in forward
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     hidden_states = self.self_attn(
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]                     ^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 172, in forward
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     qkv, _ = self.qkv_proj(hidden_states)
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 366, in forward
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     output_parallel = self.quant_method.apply(self, input_, bias)
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 134, in apply
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return tgemm.mm(x, layer.weight, bias)
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/tuned_gemm.py", line 105, in mm
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     out = self.apply_skinny(m, n, k, inp_view, weights)
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/tuned_gemm.py", line 70, in apply_skinny
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     ops.wvSpltK(weights, inp_view, out, n, self.cu_count)
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/_custom_ops.py", line 38, in wrapper
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return fn(*args, **kwargs)
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/_custom_ops.py", line 973, in wvSpltK
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     torch.ops._rocm_C.wvSpltK(a, b, out, N, cu_count)
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]   File "/opt/app-root/lib64/python3.11/site-packages/torch/_ops.py", line 1061, in __call__
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]     return self_._op(*args, **(kwargs or {}))
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233] RuntimeError: CUDA kernel failed : 98
      (VllmWorkerProcess pid=1625) ERROR 11-27 02:55:58 multiproc_worker_utils.py:233] 
      ERROR 11-27 02:55:58 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 1624 died, exit code: -15
      INFO 11-27 02:55:58 multiproc_worker_utils.py:124] Killing local vLLM worker processes
      Process SpawnProcess-1:
      Traceback (most recent call last):
        File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
          self.run()
        File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
          self._target(*self._args, **self._kwargs)
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine
          engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_args
          return cls(
                 ^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
          self.engine = LLMEngine(*args,
                        ^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 339, in __init__
          self._initialize_kv_caches()
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 487, in _initialize_kv_caches
          self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 63, in initialize_cache
          self._run_workers("initialize_cache",
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers
          driver_worker_output = driver_worker_method(*args, **kwargs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 294, in initialize_cache
          self._warm_up_model()
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 310, in _warm_up_model
          self.model_runner.capture_model(self.gpu_cache)
        File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
          return func(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 1448, in capture_model
          graph_runner.capture(**capture_inputs)
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 1711, in capture
          self.model(
        File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
          return self._call_impl(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
          return forward_call(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 422, in forward
          model_output = self.model(input_ids, positions, kv_caches,
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
          return self._call_impl(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
          return forward_call(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 323, in forward
          hidden_states = layer(
                          ^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
          return self._call_impl(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
          return forward_call(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 242, in forward
          hidden_states = self.self_attn(
                          ^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
          return self._call_impl(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
          return forward_call(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/granite.py", line 172, in forward
          qkv, _ = self.qkv_proj(hidden_states)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
          return self._call_impl(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
          return forward_call(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 366, in forward
          output_parallel = self.quant_method.apply(self, input_, bias)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 134, in apply
          return tgemm.mm(x, layer.weight, bias)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/tuned_gemm.py", line 105, in mm
          out = self.apply_skinny(m, n, k, inp_view, weights)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/tuned_gemm.py", line 70, in apply_skinny
          ops.wvSpltK(weights, inp_view, out, n, self.cu_count)
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/_custom_ops.py", line 38, in wrapper
          return fn(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/_custom_ops.py", line 973, in wvSpltK
          torch.ops._rocm_C.wvSpltK(a, b, out, N, cu_count)
        File "/opt/app-root/lib64/python3.11/site-packages/torch/_ops.py", line 1061, in __call__
          return self_._op(*args, **(kwargs or {}))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      RuntimeError: CUDA kernel failed : 98
      

       

              fdupont@redhat.com Fabien Dupont
              jgreene@redhat.com Jason Greene
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: