Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-2412

Fabric manager does not always start, resulting in CUDA failures

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Release Notes
    • Known Issue
    • Approved

      Workaround

      sudo systemctl stop nvidia-persistenced.service
      sudo systemctl start nvidia-fabricmanager.service
      sudo systemctl start nvidia-persistenced.service

       

      Analysis

      As can be seen below there is two problems

      1. Nvidia fabric manager starts before the device is finished initializing, so fails the proc dir condition check
      2. Nvidia persistence starts BEFORE fabric manager. It should always be after

      The solution to 1 is likely a Requires+After device dep on nvidia nvswitch%i, or a later start of the boot, and 2 should have an After on fabric manager 

       

       
      Nov 27 20:37:50 ip-10-0-30-37.us-east-2.compute.internal kernel: nvidia-nvswitch 0000:cf:00.0: enabling device (0000 -> 0002)
      Nov 27 20:37:50 ip-10-0-30-37.us-east-2.compute.internal kernel: nvidia-nvswitch 0000:cf:00.0: PCI INT A: no GSI
      Nov 27 20:37:51 ip-10-0-30-37.us-east-2.compute.internal systemd[1]: Starting NVIDIA Persistence Daemon...
      Nov 27 20:37:51 ip-10-0-30-37.us-east-2.compute.internal systemd[1]: Starting Generate /etc/cdi/nvidia.yaml...
      Nov 27 20:37:51 ip-10-0-30-37.us-east-2.compute.internal nvidia-persistenced[3475]: Verbose syslog connection opened
      Nov 27 20:37:51 ip-10-0-30-37.us-east-2.compute.internal nvidia-persistenced[3475]: Started (3475)
      Nov 27 20:37:51 ip-10-0-30-37.us-east-2.compute.internal nvidia-ctk[3474]: time="2024-11-27T20:37:51Z" level=info msg="Using /usr/lib64/libnvidia-ml.so.550.127.05"
      Nov 27 20:37:51 ip-10-0-30-37.us-east-2.compute.internal nvidia-ctk[3474]: time="2024-11-27T20:37:51Z" level=warning msg="Ignoring error in locating libnvidia-sandboxutils.so.1: pattern libnvidia-sandboxutils.so.1 not found\nlibnvidia-sandboxutils.so.1: not found"
      Nov 27 20:37:51 ip-10-0-30-37.us-east-2.compute.internal kernel: nvidia-nvswitch0: using MSI
      Nov 27 20:37:54 ip-10-0-30-37.us-east-2.compute.internal systemd[1]: NVIDIA fabric manager service was skipped because of an unmet condition check (ConditionDirectoryNotEmpty=/proc/driver/nvidia-nvswitch/devices).
      Nov 27 20:37:54 ip-10-0-30-37.us-east-2.compute.internal .tmpfS1Bo0[3703]: Fetching ostree-unverified-registry:registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3
      Nov 27 20:37:57 ip-10-0-30-37.us-east-2.compute.internal kernel: nvidia-nvswitch: Probing device 0000:d0:00.0, Vendor Id = 0x10de, Device Id = 0x22a3, Class = 0x68000
      Nov 27 20:37:57 ip-10-0-30-37.us-east-2.compute.internal kernel: nvidia-nvswitch 0000:d0:00.0: enabling device (0000 -> 0002)
      Nov 27 20:37:57 ip-10-0-30-37.us-east-2.compute.internal kernel: nvidia-nvswitch 0000:d0:00.0: PCI INT A: no GSI
      Nov 27 20:37:58 ip-10-0-30-37.us-east-2.compute.internal kernel: nvidia-nvswitch1: using MSI

       

      To Reproduce Steps to reproduce the behavior:

      1. Train a model
      2. Serve the model - success
      3. reboot the node
      4. Attempt to serve the model

      Result:

      ilab model serve --model-path /var/home/cloud-user/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_29117
      INFO 2024-11-26 22:18:04,244 instructlab.model.serve_backend:56: Using model '/var/home/cloud-user/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_29117' with -1 gpu-lay
      ers and 4096 max context size.                                                                                                                                                               
      INFO 2024-11-26 22:18:04,244 instructlab.model.serve_backend:88: '--gpus' flag used alongside '--tensor-parallel-size' in the vllm_args section of the config file. Using value of the --gpus
      flag.                                                                
      INFO 2024-11-26 22:18:04,246 instructlab.model.backends.vllm:313: vLLM starting up on pid 76 at http://127.0.0.1:8000/v1
      INFO 11-26 22:18:10 api_server.py:526] vLLM API server version 0.6.2
      INFO 11-26 22:18:10 api_server.py:527] args: Namespace(host='127.0.0.1', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_h
      eaders=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template='/tmp/tmpbkz643jw', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_
      cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/var/home/c
      loud-user/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_29117', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None,
      tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=
      None, guided_decoding_backend='outlines', distributed_executor_backend='mp', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_wo
      rkers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, g
      pu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_t
      heta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_poo
      l_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_fac
      tors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_str
      eam_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor
      _parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='reje
      ction_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=No
      ne, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=Fals
      e, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
      INFO 11-26 22:18:10 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/b4da7575-80ea-445b-9f9d-adb7a092d83c for IPC Path.
      INFO 11-26 22:18:10 api_server.py:177] Started engine process with PID 80
      INFO 11-26 22:18:10 config.py:1652] Downcasting torch.float32 to torch.float16.
      INFO 11-26 22:18:14 config.py:1652] Downcasting torch.float32 to torch.float16.
      INFO 11-26 22:18:14 llm_engine.py:226] Initializing an LLM engine (v0.6.2) with config: model='/var/home/cloud-user/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_29117
      ', speculative_config=None, tokenizer='/var/home/cloud-user/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_29117', skip_tokenizer_init=False, tokenizer_mode=auto, revis
      ion=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_fo
      rmat=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=
      None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=F
      alse, collect_model_execute_time=False), seed=0, served_model_name=/var/home/cloud-user/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_29117, use_v2_block_manager=False
      , num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
      WARNING 11-26 22:18:14 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to t
      une this value as needed.
      INFO 11-26 22:18:14 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
      (VllmWorkerProcess pid=151) INFO 11-26 22:18:15 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
      (VllmWorkerProcess pid=153) INFO 11-26 22:18:15 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
      (VllmWorkerProcess pid=152) INFO 11-26 22:18:15 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
      (VllmWorkerProcess pid=148) INFO 11-26 22:18:16 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
      (VllmWorkerProcess pid=147) INFO 11-26 22:18:16 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
      (VllmWorkerProcess pid=150) INFO 11-26 22:18:16 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
      (VllmWorkerProcess pid=149) INFO 11-26 22:18:16 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
      (VllmWorkerProcess pid=151) INFO 11-26 22:18:45 multiproc_worker_utils.py:244] Worker exiting
      INFO 11-26 22:18:45 multiproc_worker_utils.py:124] Killing local vLLM worker processes Process SpawnProcess-1:
      Traceback (most recent call last):
        File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
          self.run()
        File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
          self._target(*self._args, **self._kwargs)
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine
          engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_args
          return cls(
                 ^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
          self.engine = LLMEngine(*args,
                        ^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 325, in __init__
          self.model_executor = executor_class(
                                ^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
          super().__init__(*args, **kwargs)
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/executor_base.py", line 47, in __init__
          self._init_executor()
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 110, in _init_executor
          self._run_workers("init_device")
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers
          driver_worker_output = driver_worker_method(*args, **kwargs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 166, in init_device
          torch.cuda.set_device(self.device)
        File "/opt/app-root/lib64/python3.11/site-packages/torch/cuda/__init__.py", line 420, in set_device
          torch._C._cuda_setDevice(device)
        File "/opt/app-root/lib64/python3.11/site-packages/torch/cuda/__init__.py", line 314, in _lazy_init
          torch._C._cuda_init()
      RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet init
      ialized Traceback (most recent call last):
        File "<frozen runpy>", line 198, in _run_module_as_main
        File "<frozen runpy>", line 88, in _run_code
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 571, in <module>
          uvloop.run(run_server(args))
        File "/opt/app-root/lib64/python3.11/site-packages/uvloop/__init__.py", line 105, in run
          return runner.run(wrapper())
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/usr/lib64/python3.11/asyncio/runners.py", line 118, in run
          return self._loop.run_until_complete(task)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
        File "/opt/app-root/lib64/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
          return await main
                 ^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 538, in run_server
          async with build_async_engine_client(args) as engine_client:
        File "/usr/lib64/python3.11/contextlib.py", line 210, in __aenter__
          return await anext(self.gen)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 105, in build_async_engine_client
          async with build_async_engine_client_from_engine_args(
        File "/usr/lib64/python3.11/contextlib.py", line 210, in __aenter__
          return await anext(self.gen)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 192, in build_async_engine_client_from_engine_args
          raise RuntimeError(
      RuntimeError: Engine process failed to start
       

       
      Expected behavior

      • The model should be served with no issues - just like before the reboot
      • Hardware Specs: AWS p5.48xlarge
      •  
      • Python Version: Python 3.9.18
      • InstructLab Version:   0.21.0
      • OS Version:
      • NAME="Red Hat Enterprise Linux"
        VERSION="9.20241104.0.4 (Plow)"
        ID="rhel"
        ID_LIKE="fedora"
        VERSION_ID="9.4"
        PLATFORM_ID="platform:el9"
        PRETTY_NAME="Red Hat Enterprise Linux 9.20241104.0.4 (Plow)"
        ANSI_COLOR="0;31"
        LOGO="fedora-logo-icon"
        CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos"
        HOME_URL="https://www.redhat.com/"
        DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9"
        BUG_REPORT_URL="https://issues.redhat.com/"
        REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9"
        REDHAT_BUGZILLA_PRODUCT_VERSION=9.4
        REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
        REDHAT_SUPPORT_PRODUCT_VERSION="9.4"
        OSTREE_VERSION='9.20241104.0'
        VARIANT="RHEL AI"
        VARIANT_ID=rhel_ai
        RHEL_AI_VERSION_ID='1.3.0'

              fdupont@redhat.com Fabien Dupont
              achuzhoy@redhat.com Alexander Chuzhoy
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated: