Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-1498

RHEL AI 1.5 - vLLM fails to start on during training when using a separate data disk

    • False
    • Hide

      None

      Show
      None
    • False
    • Known Issue
    • AIPCC Application Platform 7, AIPCC Application Platform 8, AP Sprint 10, AP Sprint 11, AP Sprint 12, AP Sprint 13, AP Sprint 14, AP Sprint 15, AP Sprint 16, AP Sprint 17

      To Reproduce Steps to reproduce the behavior:

      1. Deploy RHEL AI v1.5-7 Prod onto IBM Cloud (requires a data disk) or any other cloud instance with a data disk
      2. Prepare the data disk (Format as XFS, set $ILAB_HOME in ~/.bash_profile)
      3. Prepare the instructlab (ilab config init, download models)
      4. Run ilab data generate
      5. Run ilab model train (short)

      Expected behavior

      • Successful Training, and the vllm server always starts up with ample attempts remaining.

      Actual behavior:

      1. `ilab data generate` succeeds at starting up on attempt 117 out of 120.
      2. `ilab model train` (short) fails 58 minutes into the training. At that point in the test, it attempts to start the vllm server, but all 120 attempts are exhausted.
      INFO 2025-05-19 18:45:48,725 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54461/v1, this might take a moment... Attempt: 120/120
      INFO 2025-05-19 18:45:50,065 instructlab.model.backends.vllm:148: Gave up waiting for vLLM server to start at http://127.0.0.1:54461/v1 after 120 attempts
      Traceback (most recent call last):
        File "/usr/lib64/python3.11/asyncio/runners.py", line 118, in run
          return self._loop.run_until_complete(task)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
        File "/opt/app-root/lib64/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
          return await main
                 ^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
          async with build_async_engine_client(args) as engine_client:
        File "/usr/lib64/python3.11/contextlib.py", line 210, in __aenter__
          return await anext(self.gen)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
          async with build_async_engine_client_from_engine_args(
        File "/usr/lib64/python3.11/contextlib.py", line 210, in __aenter__
          return await anext(self.gen)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 264, in build_async_engine_client_from_engine_args
          await mq_engine_client.setup()
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 284, in setup
          response = await self._wait_for_server_rpc(socket)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 392, in _wait_for_server_rpc
          return await self._send_get_data_rpc_request(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 320, in _send_get_data_rpc_request
          if await socket.poll(timeout=VLLM_RPC_TIMEOUT) == 0:
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      asyncio.exceptions.CancelledErrorDuring handling of the above exception, another exception occurred:Traceback (most recent call last):
        File "<frozen runpy>", line 198, in _run_module_as_main
        File "<frozen runpy>", line 88, in _run_code
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1121, in <module>
          uvloop.run(run_server(args))
        File "/opt/app-root/lib64/python3.11/site-packages/uvloop/__init__.py", line 105, in run
          return runner.run(wrapper())
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/usr/lib64/python3.11/asyncio/runners.py", line 123, in run
          raise KeyboardInterrupt()
      KeyboardInterrupt
      INFO 2025-05-19 18:45:52,883 instructlab.model.backends.vllm:512: Waiting for GPU VRAM reclamation...
      ERROR 2025-05-19 18:45:58,885 instructlab.model.evaluate:832: Failed to start server: vLLM failed to start up in 397.5 seconds
      Accelerated Training failed with Failed to start server: vLLM failed to start up in 397.5 seconds
      real    58m12.297s
      user    0m1.926s
      sys    0m1.239s
       

       

       

      Device Info (please complete the following information):

      • Hardware Specs: gx3d-208x1792x8mi300x (8*MI300X)
      • OS Version: RHEL AI 1.5-7 Prod
      • InstructLab Version: 0.26.1
      • Provide the output of these two commands:
        • sudo bootc status --format json | jq .status.booted.image.image.image : "registry.redhat.io/rhelai1/bootc-amd-rhel9:1.5"
        • ilab system info

       

       

      Platform:
        sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
        sys.platform: linux
        os.name: posix
        platform.release: 5.14.0-427.65.1.el9_4.x86_64
        platform.machine: x86_64
        platform.node: mdepaulo-v157-amd-prod-2
        platform.python_version: 3.11.7
        os-release.ID: rhel
        os-release.VERSION_ID: 9.4
        os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
        memory.total: 1763.82 GB
        memory.available: 1729.31 GB
        memory.used: 28.04 GBInstructLab:
        instructlab.version: 0.26.1
        instructlab-dolomite.version: 0.2.0
        instructlab-eval.version: 0.5.1
        instructlab-quantize.version: 0.1.0
        instructlab-schema.version: 0.4.2
        instructlab-sdg.version: 0.8.2
        instructlab-training.version: 0.10.2Torch:
        torch.version: 2.6.0
        torch.backends.cpu.capability: AVX512
        torch.version.cuda: None
        torch.version.hip: 6.3.42134-a9a80e791
        torch.cuda.available: True
        torch.backends.cuda.is_built: True
        torch.backends.mps.is_built: False
        torch.backends.mps.is_available: False
        torch.cuda.bf16: True
        torch.cuda.current.device: 0
        torch.cuda.0.name: AMD Radeon Graphics
        torch.cuda.0.free: 191.5 GB
        torch.cuda.0.total: 192.0 GB
        torch.cuda.0.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.1.name: AMD Radeon Graphics
        torch.cuda.1.free: 191.5 GB
        torch.cuda.1.total: 192.0 GB
        torch.cuda.1.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.2.name: AMD Radeon Graphics
        torch.cuda.2.free: 191.5 GB
        torch.cuda.2.total: 192.0 GB
        torch.cuda.2.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.3.name: AMD Radeon Graphics
        torch.cuda.3.free: 191.5 GB
        torch.cuda.3.total: 192.0 GB
        torch.cuda.3.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.4.name: AMD Radeon Graphics
        torch.cuda.4.free: 191.5 GB
        torch.cuda.4.total: 192.0 GB
        torch.cuda.4.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.5.name: AMD Radeon Graphics
        torch.cuda.5.free: 191.5 GB
        torch.cuda.5.total: 192.0 GB
        torch.cuda.5.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.6.name: AMD Radeon Graphics
        torch.cuda.6.free: 191.5 GB
        torch.cuda.6.total: 192.0 GB
        torch.cuda.6.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.7.name: AMD Radeon Graphics
        torch.cuda.7.free: 191.5 GB
        torch.cuda.7.total: 192.0 GB
        torch.cuda.7.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)llama_cpp_python:
        llama_cpp_python.version: 0.3.6
        llama_cpp_python.supports_gpu_offload: False
       

      Bug impact

      • Training cannot be run.

      Known workaround

      • Raise the 2 instances of `max_startup_attempts` in ilab config.yaml from 120 to 1200.

      Additional context

      ilab model serve and ilab model chat worked, but nearly timed out. They succeeded on attempt 117 out of 120.

       

        1. ilab-model-train.txt
          366 kB
        2. ilab-config-show
          20 kB
        3. ilab-data-generate
          65 kB
        4. ilab-train-1200attempts
          2.61 MB

              mdepaulo@redhat.com Mike DePaulo
              mdepaulo@redhat.com Mike DePaulo
              Antonio's Team
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: