-
Bug
-
Resolution: Unresolved
-
Critical
-
RHELAI 1.3 GA
-
False
-
-
False
-
Release Notes
-
Known Issue
-
-
-
Approved
Workaround
sudo systemctl stop nvidia-persistenced.service sudo systemctl start nvidia-fabricmanager.service sudo systemctl start nvidia-persistenced.service
Analysis
As can be seen below there is two problems
- Nvidia fabric manager starts before the device is finished initializing, so fails the proc dir condition check
- Nvidia persistence starts BEFORE fabric manager. It should always be after
The solution to 1 is likely a Requires+After device dep on nvidia nvswitch%i, or a later start of the boot, and 2 should have an After on fabric manager
Nov 27 20:37:50 ip-10-0-30-37.us-east-2.compute.internal kernel: nvidia-nvswitch 0000:cf:00.0: enabling device (0000 -> 0002)
Nov 27 20:37:50 ip-10-0-30-37.us-east-2.compute.internal kernel: nvidia-nvswitch 0000:cf:00.0: PCI INT A: no GSI
Nov 27 20:37:51 ip-10-0-30-37.us-east-2.compute.internal systemd[1]: Starting NVIDIA Persistence Daemon...
Nov 27 20:37:51 ip-10-0-30-37.us-east-2.compute.internal systemd[1]: Starting Generate /etc/cdi/nvidia.yaml...
Nov 27 20:37:51 ip-10-0-30-37.us-east-2.compute.internal nvidia-persistenced[3475]: Verbose syslog connection opened
Nov 27 20:37:51 ip-10-0-30-37.us-east-2.compute.internal nvidia-persistenced[3475]: Started (3475)
Nov 27 20:37:51 ip-10-0-30-37.us-east-2.compute.internal nvidia-ctk[3474]: time="2024-11-27T20:37:51Z" level=info msg="Using /usr/lib64/libnvidia-ml.so.550.127.05"
Nov 27 20:37:51 ip-10-0-30-37.us-east-2.compute.internal nvidia-ctk[3474]: time="2024-11-27T20:37:51Z" level=warning msg="Ignoring error in locating libnvidia-sandboxutils.so.1: pattern libnvidia-sandboxutils.so.1 not found\nlibnvidia-sandboxutils.so.1: not found"
Nov 27 20:37:51 ip-10-0-30-37.us-east-2.compute.internal kernel: nvidia-nvswitch0: using MSI
Nov 27 20:37:54 ip-10-0-30-37.us-east-2.compute.internal systemd[1]: NVIDIA fabric manager service was skipped because of an unmet condition check (ConditionDirectoryNotEmpty=/proc/driver/nvidia-nvswitch/devices).
Nov 27 20:37:54 ip-10-0-30-37.us-east-2.compute.internal .tmpfS1Bo0[3703]: Fetching ostree-unverified-registry:registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3
Nov 27 20:37:57 ip-10-0-30-37.us-east-2.compute.internal kernel: nvidia-nvswitch: Probing device 0000:d0:00.0, Vendor Id = 0x10de, Device Id = 0x22a3, Class = 0x68000
Nov 27 20:37:57 ip-10-0-30-37.us-east-2.compute.internal kernel: nvidia-nvswitch 0000:d0:00.0: enabling device (0000 -> 0002)
Nov 27 20:37:57 ip-10-0-30-37.us-east-2.compute.internal kernel: nvidia-nvswitch 0000:d0:00.0: PCI INT A: no GSI
Nov 27 20:37:58 ip-10-0-30-37.us-east-2.compute.internal kernel: nvidia-nvswitch1: using MSI
To Reproduce Steps to reproduce the behavior:
- Train a model
- Serve the model - success
- reboot the node
- Attempt to serve the model
Result:
ilab model serve --model-path /var/home/cloud-user/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_29117 INFO 2024-11-26 22:18:04,244 instructlab.model.serve_backend:56: Using model '/var/home/cloud-user/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_29117' with -1 gpu-lay ers and 4096 max context size. INFO 2024-11-26 22:18:04,244 instructlab.model.serve_backend:88: '--gpus' flag used alongside '--tensor-parallel-size' in the vllm_args section of the config file. Using value of the --gpus flag. INFO 2024-11-26 22:18:04,246 instructlab.model.backends.vllm:313: vLLM starting up on pid 76 at http://127.0.0.1:8000/v1 INFO 11-26 22:18:10 api_server.py:526] vLLM API server version 0.6.2 INFO 11-26 22:18:10 api_server.py:527] args: Namespace(host='127.0.0.1', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_h eaders=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template='/tmp/tmpbkz643jw', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_ cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/var/home/c loud-user/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_29117', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len= None, guided_decoding_backend='outlines', distributed_executor_backend='mp', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_wo rkers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, g pu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_t heta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_poo l_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_fac tors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_str eam_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor _parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='reje ction_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=No ne, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=Fals e, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False) INFO 11-26 22:18:10 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/b4da7575-80ea-445b-9f9d-adb7a092d83c for IPC Path. INFO 11-26 22:18:10 api_server.py:177] Started engine process with PID 80 INFO 11-26 22:18:10 config.py:1652] Downcasting torch.float32 to torch.float16. INFO 11-26 22:18:14 config.py:1652] Downcasting torch.float32 to torch.float16. INFO 11-26 22:18:14 llm_engine.py:226] Initializing an LLM engine (v0.6.2) with config: model='/var/home/cloud-user/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_29117 ', speculative_config=None, tokenizer='/var/home/cloud-user/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_29117', skip_tokenizer_init=False, tokenizer_mode=auto, revis ion=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_fo rmat=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path= None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=F alse, collect_model_execute_time=False), seed=0, served_model_name=/var/home/cloud-user/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_29117, use_v2_block_manager=False , num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None) WARNING 11-26 22:18:14 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to t une this value as needed. INFO 11-26 22:18:14 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager (VllmWorkerProcess pid=151) INFO 11-26 22:18:15 multiproc_worker_utils.py:218] Worker ready; awaiting tasks (VllmWorkerProcess pid=153) INFO 11-26 22:18:15 multiproc_worker_utils.py:218] Worker ready; awaiting tasks (VllmWorkerProcess pid=152) INFO 11-26 22:18:15 multiproc_worker_utils.py:218] Worker ready; awaiting tasks (VllmWorkerProcess pid=148) INFO 11-26 22:18:16 multiproc_worker_utils.py:218] Worker ready; awaiting tasks (VllmWorkerProcess pid=147) INFO 11-26 22:18:16 multiproc_worker_utils.py:218] Worker ready; awaiting tasks (VllmWorkerProcess pid=150) INFO 11-26 22:18:16 multiproc_worker_utils.py:218] Worker ready; awaiting tasks (VllmWorkerProcess pid=149) INFO 11-26 22:18:16 multiproc_worker_utils.py:218] Worker ready; awaiting tasks (VllmWorkerProcess pid=151) INFO 11-26 22:18:45 multiproc_worker_utils.py:244] Worker exiting INFO 11-26 22:18:45 multiproc_worker_utils.py:124] Killing local vLLM worker processes Process SpawnProcess-1: Traceback (most recent call last): File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine engine = MQLLMEngine.from_engine_args(engine_args=engine_args, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_args return cls( ^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__ self.engine = LLMEngine(*args, ^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 325, in __init__ self.model_executor = executor_class( ^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__ super().__init__(*args, **kwargs) File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/executor_base.py", line 47, in __init__ self._init_executor() File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 110, in _init_executor self._run_workers("init_device") File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers driver_worker_output = driver_worker_method(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 166, in init_device torch.cuda.set_device(self.device) File "/opt/app-root/lib64/python3.11/site-packages/torch/cuda/__init__.py", line 420, in set_device torch._C._cuda_setDevice(device) File "/opt/app-root/lib64/python3.11/site-packages/torch/cuda/__init__.py", line 314, in _lazy_init torch._C._cuda_init() RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet init ialized Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 571, in <module> uvloop.run(run_server(args)) File "/opt/app-root/lib64/python3.11/site-packages/uvloop/__init__.py", line 105, in run return runner.run(wrapper()) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib64/python3.11/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/opt/app-root/lib64/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper return await main ^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 538, in run_server async with build_async_engine_client(args) as engine_client: File "/usr/lib64/python3.11/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 105, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib64/python3.11/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 192, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start
Expected behavior
- The model should be served with no issues - just like before the reboot
- Hardware Specs: AWS p5.48xlarge
- Python Version: Python 3.9.18
- InstructLab Version: 0.21.0
- OS Version:
- NAME="Red Hat Enterprise Linux"
VERSION="9.20241104.0.4 (Plow)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="9.4"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Red Hat Enterprise Linux 9.20241104.0.4 (Plow)"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9"
BUG_REPORT_URL="https://issues.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_BUGZILLA_PRODUCT_VERSION=9.4
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.4"
OSTREE_VERSION='9.20241104.0'
VARIANT="RHEL AI"
VARIANT_ID=rhel_ai
RHEL_AI_VERSION_ID='1.3.0'