Loading...

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: rhelai-1.3.2
Affects Version/s: RHELAI 1.3 GA
Component/s: Accelerators - NVIDIA
Labels:
- blocker-1.3

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Documentation Type:

Release Notes
Release Note Type:
Known Issue
Intelligence Requested:
Market:

Release Blocker:
Approved

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Workaround

sudo systemctl stop nvidia-persistenced.service
sudo systemctl start nvidia-fabricmanager.service
sudo systemctl start nvidia-persistenced.service

Analysis

As can be seen below there is two problems

Nvidia fabric manager starts before the device is finished initializing, so fails the proc dir condition check
Nvidia persistence starts BEFORE fabric manager. It should always be after

The solution to 1 is likely a Requires+After device dep on nvidia nvswitch%i, or a later start of the boot, and 2 should have an After on fabric manager

Nov 27 20:37:50 ip-10-0-30-37.us-east-2.compute.internal kernel: nvidia-nvswitch 0000:cf:00.0: enabling device (0000 -> 0002)
Nov 27 20:37:50 ip-10-0-30-37.us-east-2.compute.internal kernel: nvidia-nvswitch 0000:cf:00.0: PCI INT A: no GSI
Nov 27 20:37:51 ip-10-0-30-37.us-east-2.compute.internal systemd[1]: Starting NVIDIA Persistence Daemon...
Nov 27 20:37:51 ip-10-0-30-37.us-east-2.compute.internal systemd[1]: Starting Generate /etc/cdi/nvidia.yaml...
Nov 27 20:37:51 ip-10-0-30-37.us-east-2.compute.internal nvidia-persistenced[3475]: Verbose syslog connection opened
Nov 27 20:37:51 ip-10-0-30-37.us-east-2.compute.internal nvidia-persistenced[3475]: Started (3475)
Nov 27 20:37:51 ip-10-0-30-37.us-east-2.compute.internal nvidia-ctk[3474]: time="2024-11-27T20:37:51Z" level=info msg="Using /usr/lib64/libnvidia-ml.so.550.127.05"
Nov 27 20:37:51 ip-10-0-30-37.us-east-2.compute.internal nvidia-ctk[3474]: time="2024-11-27T20:37:51Z" level=warning msg="Ignoring error in locating libnvidia-sandboxutils.so.1: pattern libnvidia-sandboxutils.so.1 not found\nlibnvidia-sandboxutils.so.1: not found"
Nov 27 20:37:51 ip-10-0-30-37.us-east-2.compute.internal kernel: nvidia-nvswitch0: using MSI
Nov 27 20:37:54 ip-10-0-30-37.us-east-2.compute.internal systemd[1]: NVIDIA fabric manager service was skipped because of an unmet condition check (ConditionDirectoryNotEmpty=/proc/driver/nvidia-nvswitch/devices).
Nov 27 20:37:54 ip-10-0-30-37.us-east-2.compute.internal .tmpfS1Bo0[3703]: Fetching ostree-unverified-registry:registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3
Nov 27 20:37:57 ip-10-0-30-37.us-east-2.compute.internal kernel: nvidia-nvswitch: Probing device 0000:d0:00.0, Vendor Id = 0x10de, Device Id = 0x22a3, Class = 0x68000
Nov 27 20:37:57 ip-10-0-30-37.us-east-2.compute.internal kernel: nvidia-nvswitch 0000:d0:00.0: enabling device (0000 -> 0002)
Nov 27 20:37:57 ip-10-0-30-37.us-east-2.compute.internal kernel: nvidia-nvswitch 0000:d0:00.0: PCI INT A: no GSI
Nov 27 20:37:58 ip-10-0-30-37.us-east-2.compute.internal kernel: nvidia-nvswitch1: using MSI

To Reproduce Steps to reproduce the behavior:

Train a model
Serve the model - success
reboot the node
Attempt to serve the model

Result:

ilab model serve --model-path /var/home/cloud-user/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_29117
INFO 2024-11-26 22:18:04,244 instructlab.model.serve_backend:56: Using model '/var/home/cloud-user/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_29117' with -1 gpu-lay
ers and 4096 max context size.                                                                                                                                                               
INFO 2024-11-26 22:18:04,244 instructlab.model.serve_backend:88: '--gpus' flag used alongside '--tensor-parallel-size' in the vllm_args section of the config file. Using value of the --gpus
flag.                                                                
INFO 2024-11-26 22:18:04,246 instructlab.model.backends.vllm:313: vLLM starting up on pid 76 at http://127.0.0.1:8000/v1
INFO 11-26 22:18:10 api_server.py:526] vLLM API server version 0.6.2
INFO 11-26 22:18:10 api_server.py:527] args: Namespace(host='127.0.0.1', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_h
eaders=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template='/tmp/tmpbkz643jw', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_
cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/var/home/c
loud-user/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_29117', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None,
tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=
None, guided_decoding_backend='outlines', distributed_executor_backend='mp', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_wo
rkers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, g
pu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_t
heta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_poo
l_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_fac
tors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_str
eam_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor
_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='reje
ction_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=No
ne, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=Fals
e, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 11-26 22:18:10 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/b4da7575-80ea-445b-9f9d-adb7a092d83c for IPC Path.
INFO 11-26 22:18:10 api_server.py:177] Started engine process with PID 80
INFO 11-26 22:18:10 config.py:1652] Downcasting torch.float32 to torch.float16.
INFO 11-26 22:18:14 config.py:1652] Downcasting torch.float32 to torch.float16.
INFO 11-26 22:18:14 llm_engine.py:226] Initializing an LLM engine (v0.6.2) with config: model='/var/home/cloud-user/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_29117
', speculative_config=None, tokenizer='/var/home/cloud-user/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_29117', skip_tokenizer_init=False, tokenizer_mode=auto, revis
ion=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_fo
rmat=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=
None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=F
alse, collect_model_execute_time=False), seed=0, served_model_name=/var/home/cloud-user/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_29117, use_v2_block_manager=False
, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
WARNING 11-26 22:18:14 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to t
une this value as needed.
INFO 11-26 22:18:14 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=151) INFO 11-26 22:18:15 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
(VllmWorkerProcess pid=153) INFO 11-26 22:18:15 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
(VllmWorkerProcess pid=152) INFO 11-26 22:18:15 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
(VllmWorkerProcess pid=148) INFO 11-26 22:18:16 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
(VllmWorkerProcess pid=147) INFO 11-26 22:18:16 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
(VllmWorkerProcess pid=150) INFO 11-26 22:18:16 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
(VllmWorkerProcess pid=149) INFO 11-26 22:18:16 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
(VllmWorkerProcess pid=151) INFO 11-26 22:18:45 multiproc_worker_utils.py:244] Worker exiting
INFO 11-26 22:18:45 multiproc_worker_utils.py:124] Killing local vLLM worker processes Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_args
    return cls(
           ^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
    self.engine = LLMEngine(*args,
                  ^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 325, in __init__
    self.model_executor = executor_class(
                          ^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
    super().__init__(*args, **kwargs)
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 110, in _init_executor
    self._run_workers("init_device")
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers
    driver_worker_output = driver_worker_method(*args, **kwargs)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 166, in init_device
    torch.cuda.set_device(self.device)
  File "/opt/app-root/lib64/python3.11/site-packages/torch/cuda/__init__.py", line 420, in set_device
    torch._C._cuda_setDevice(device)
  File "/opt/app-root/lib64/python3.11/site-packages/torch/cuda/__init__.py", line 314, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet init
ialized Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 571, in <module>
    uvloop.run(run_server(args))
  File "/opt/app-root/lib64/python3.11/site-packages/uvloop/__init__.py", line 105, in run
    return runner.run(wrapper())
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/opt/app-root/lib64/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 538, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib64/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 105, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib64/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 192, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start

Expected behavior

The model should be served with no issues - just like before the reboot

Hardware Specs: AWS p5.48xlarge
Python Version: Python 3.9.18
InstructLab Version: 0.21.0
OS Version:
NAME="Red Hat Enterprise Linux"
VERSION="9.20241104.0.4 (Plow)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="9.4"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Red Hat Enterprise Linux 9.20241104.0.4 (Plow)"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9"
BUG_REPORT_URL="https://issues.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_BUGZILLA_PRODUCT_VERSION=9.4
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.4"
OSTREE_VERSION='9.20241104.0'
VARIANT="RHEL AI"
VARIANT_ID=rhel_ai
RHEL_AI_VERSION_ID='1.3.0'

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates