Loading...

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: rhelai-1.5
Affects Version/s: rhelai-1.5
Component/s: Development Platform
Labels:
- amd
- vllm

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Release Note Type:
Known Issue
Intelligence Requested:
Market:

Sprint:
AIPCC Application Platform 7, AIPCC Application Platform 8, AP Sprint 10, AP Sprint 11, AP Sprint 12, AP Sprint 13, AP Sprint 14, AP Sprint 15, AP Sprint 16, AP Sprint 17, AP Sprint 18, AP Sprint 19, DP Sprint 20

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

To Reproduce Steps to reproduce the behavior:

Deploy RHEL AI v1.5-7 Prod onto IBM Cloud (requires a data disk) or any other cloud instance with a data disk
Prepare the data disk (Format as XFS, set $ILAB_HOME in ~/.bash_profile)
Prepare the instructlab (ilab config init, download models)
Run ilab data generate
Run ilab model train (short)

Expected behavior

Successful Training, and the vllm server always starts up with ample attempts remaining.

Actual behavior:

`ilab data generate` succeeds at starting up on attempt 117 out of 120.
`ilab model train` (short) fails 58 minutes into the training. At that point in the test, it attempts to start the vllm server, but all 120 attempts are exhausted.

INFO 2025-05-19 18:45:48,725 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54461/v1, this might take a moment... Attempt: 120/120
INFO 2025-05-19 18:45:50,065 instructlab.model.backends.vllm:148: Gave up waiting for vLLM server to start at http://127.0.0.1:54461/v1 after 120 attempts
Traceback (most recent call last):
  File "/usr/lib64/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/opt/app-root/lib64/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib64/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib64/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 264, in build_async_engine_client_from_engine_args
    await mq_engine_client.setup()
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 284, in setup
    response = await self._wait_for_server_rpc(socket)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 392, in _wait_for_server_rpc
    return await self._send_get_data_rpc_request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 320, in _send_get_data_rpc_request
    if await socket.poll(timeout=VLLM_RPC_TIMEOUT) == 0:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledErrorDuring handling of the above exception, another exception occurred:Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1121, in <module>
    uvloop.run(run_server(args))
  File "/opt/app-root/lib64/python3.11/site-packages/uvloop/__init__.py", line 105, in run
    return runner.run(wrapper())
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/asyncio/runners.py", line 123, in run
    raise KeyboardInterrupt()
KeyboardInterrupt
INFO 2025-05-19 18:45:52,883 instructlab.model.backends.vllm:512: Waiting for GPU VRAM reclamation...
ERROR 2025-05-19 18:45:58,885 instructlab.model.evaluate:832: Failed to start server: vLLM failed to start up in 397.5 seconds
Accelerated Training failed with Failed to start server: vLLM failed to start up in 397.5 seconds
real    58m12.297s
user    0m1.926s
sys    0m1.239s

Device Info (please complete the following information):

Hardware Specs: gx3d-208x1792x8mi300x (8*MI300X)
OS Version: RHEL AI 1.5-7 Prod
InstructLab Version: 0.26.1
Provide the output of these two commands:
- sudo bootc status --format json | jq .status.booted.image.image.image : "registry.redhat.io/rhelai1/bootc-amd-rhel9:1.5"
- ilab system info :

Platform:
  sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
  sys.platform: linux
  os.name: posix
  platform.release: 5.14.0-427.65.1.el9_4.x86_64
  platform.machine: x86_64
  platform.node: mdepaulo-v157-amd-prod-2
  platform.python_version: 3.11.7
  os-release.ID: rhel
  os-release.VERSION_ID: 9.4
  os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
  memory.total: 1763.82 GB
  memory.available: 1729.31 GB
  memory.used: 28.04 GBInstructLab:
  instructlab.version: 0.26.1
  instructlab-dolomite.version: 0.2.0
  instructlab-eval.version: 0.5.1
  instructlab-quantize.version: 0.1.0
  instructlab-schema.version: 0.4.2
  instructlab-sdg.version: 0.8.2
  instructlab-training.version: 0.10.2Torch:
  torch.version: 2.6.0
  torch.backends.cpu.capability: AVX512
  torch.version.cuda: None
  torch.version.hip: 6.3.42134-a9a80e791
  torch.cuda.available: True
  torch.backends.cuda.is_built: True
  torch.backends.mps.is_built: False
  torch.backends.mps.is_available: False
  torch.cuda.bf16: True
  torch.cuda.current.device: 0
  torch.cuda.0.name: AMD Radeon Graphics
  torch.cuda.0.free: 191.5 GB
  torch.cuda.0.total: 192.0 GB
  torch.cuda.0.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.1.name: AMD Radeon Graphics
  torch.cuda.1.free: 191.5 GB
  torch.cuda.1.total: 192.0 GB
  torch.cuda.1.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.2.name: AMD Radeon Graphics
  torch.cuda.2.free: 191.5 GB
  torch.cuda.2.total: 192.0 GB
  torch.cuda.2.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.3.name: AMD Radeon Graphics
  torch.cuda.3.free: 191.5 GB
  torch.cuda.3.total: 192.0 GB
  torch.cuda.3.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.4.name: AMD Radeon Graphics
  torch.cuda.4.free: 191.5 GB
  torch.cuda.4.total: 192.0 GB
  torch.cuda.4.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.5.name: AMD Radeon Graphics
  torch.cuda.5.free: 191.5 GB
  torch.cuda.5.total: 192.0 GB
  torch.cuda.5.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.6.name: AMD Radeon Graphics
  torch.cuda.6.free: 191.5 GB
  torch.cuda.6.total: 192.0 GB
  torch.cuda.6.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.7.name: AMD Radeon Graphics
  torch.cuda.7.free: 191.5 GB
  torch.cuda.7.total: 192.0 GB
  torch.cuda.7.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)llama_cpp_python:
  llama_cpp_python.version: 0.3.6
  llama_cpp_python.supports_gpu_offload: False

Bug impact

Training cannot be run.

Known workaround

Raise the 2 instances of `max_startup_attempts` in ilab config.yaml from 120 to 1200.

Additional context

ilab model serve and ilab model chat worked, but nearly timed out. They succeeded on attempt 117 out of 120.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

ilab-model-train.txt
2025/05/19 7:36 PM
366 kB
Mike DePaulo
ilab-config-show
2025/05/19 7:42 PM
20 kB
Mike DePaulo
ilab-data-generate
2025/05/19 7:54 PM
65 kB
Mike DePaulo
ilab-train-1200attempts
2025/05/20 11:51 AM
2.61 MB
Mike DePaulo

clones

AIPCC-726 RHEL AI 1.4.3 - vllm fails to start on AMD for SDG - Ready to be tested

Closed

is caused by

AIPCC-1500 RHEL AI - podman's fuse-overlayfs has heavy CPU usage and hangs the vLLM server startup

In Progress

is cloned by

AIPCC-1500 RHEL AI - podman's fuse-overlayfs has heavy CPU usage and hangs the vLLM server startup

In Progress

AIPCC-6721 Llama-stack Pipeline is failed since merging Renovate MR

Closed

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty