-
Bug
-
Resolution: Unresolved
-
Critical
-
rhelai-1.5
-
False
-
-
False
-
Known Issue
-
-
-
AIPCC Application Platform 7, AIPCC Application Platform 8, AP Sprint 10, AP Sprint 11, AP Sprint 12, AP Sprint 13, AP Sprint 14, AP Sprint 15, AP Sprint 16, AP Sprint 17
To Reproduce Steps to reproduce the behavior:
- Deploy RHEL AI v1.5-7 Prod onto IBM Cloud (requires a data disk) or any other cloud instance with a data disk
- Prepare the data disk (Format as XFS, set $ILAB_HOME in ~/.bash_profile)
- Prepare the instructlab (ilab config init, download models)
- Run ilab data generate
- Run ilab model train (short)
Expected behavior
- Successful Training, and the vllm server always starts up with ample attempts remaining.
Actual behavior:
- `ilab data generate` succeeds at starting up on attempt 117 out of 120.
- `ilab model train` (short) fails 58 minutes into the training. At that point in the test, it attempts to start the vllm server, but all 120 attempts are exhausted.
INFO 2025-05-19 18:45:48,725 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54461/v1, this might take a moment... Attempt: 120/120 INFO 2025-05-19 18:45:50,065 instructlab.model.backends.vllm:148: Gave up waiting for vLLM server to start at http://127.0.0.1:54461/v1 after 120 attempts Traceback (most recent call last): File "/usr/lib64/python3.11/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/opt/app-root/lib64/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper return await main ^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server async with build_async_engine_client(args) as engine_client: File "/usr/lib64/python3.11/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib64/python3.11/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 264, in build_async_engine_client_from_engine_args await mq_engine_client.setup() File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 284, in setup response = await self._wait_for_server_rpc(socket) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 392, in _wait_for_server_rpc return await self._send_get_data_rpc_request( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 320, in _send_get_data_rpc_request if await socket.poll(timeout=VLLM_RPC_TIMEOUT) == 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ asyncio.exceptions.CancelledErrorDuring handling of the above exception, another exception occurred:Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1121, in <module> uvloop.run(run_server(args)) File "/opt/app-root/lib64/python3.11/site-packages/uvloop/__init__.py", line 105, in run return runner.run(wrapper()) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib64/python3.11/asyncio/runners.py", line 123, in run raise KeyboardInterrupt() KeyboardInterrupt INFO 2025-05-19 18:45:52,883 instructlab.model.backends.vllm:512: Waiting for GPU VRAM reclamation... ERROR 2025-05-19 18:45:58,885 instructlab.model.evaluate:832: Failed to start server: vLLM failed to start up in 397.5 seconds Accelerated Training failed with Failed to start server: vLLM failed to start up in 397.5 seconds real 58m12.297s user 0m1.926s sys 0m1.239s
Device Info (please complete the following information):
- Hardware Specs: gx3d-208x1792x8mi300x (8*MI300X)
- OS Version: RHEL AI 1.5-7 Prod
- InstructLab Version: 0.26.1
- Provide the output of these two commands:
- sudo bootc status --format json | jq .status.booted.image.image.image : "registry.redhat.io/rhelai1/bootc-amd-rhel9:1.5"
- ilab system info :
Platform: sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] sys.platform: linux os.name: posix platform.release: 5.14.0-427.65.1.el9_4.x86_64 platform.machine: x86_64 platform.node: mdepaulo-v157-amd-prod-2 platform.python_version: 3.11.7 os-release.ID: rhel os-release.VERSION_ID: 9.4 os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow) memory.total: 1763.82 GB memory.available: 1729.31 GB memory.used: 28.04 GBInstructLab: instructlab.version: 0.26.1 instructlab-dolomite.version: 0.2.0 instructlab-eval.version: 0.5.1 instructlab-quantize.version: 0.1.0 instructlab-schema.version: 0.4.2 instructlab-sdg.version: 0.8.2 instructlab-training.version: 0.10.2Torch: torch.version: 2.6.0 torch.backends.cpu.capability: AVX512 torch.version.cuda: None torch.version.hip: 6.3.42134-a9a80e791 torch.cuda.available: True torch.backends.cuda.is_built: True torch.backends.mps.is_built: False torch.backends.mps.is_available: False torch.cuda.bf16: True torch.cuda.current.device: 0 torch.cuda.0.name: AMD Radeon Graphics torch.cuda.0.free: 191.5 GB torch.cuda.0.total: 192.0 GB torch.cuda.0.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.1.name: AMD Radeon Graphics torch.cuda.1.free: 191.5 GB torch.cuda.1.total: 192.0 GB torch.cuda.1.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.2.name: AMD Radeon Graphics torch.cuda.2.free: 191.5 GB torch.cuda.2.total: 192.0 GB torch.cuda.2.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.3.name: AMD Radeon Graphics torch.cuda.3.free: 191.5 GB torch.cuda.3.total: 192.0 GB torch.cuda.3.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.4.name: AMD Radeon Graphics torch.cuda.4.free: 191.5 GB torch.cuda.4.total: 192.0 GB torch.cuda.4.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.5.name: AMD Radeon Graphics torch.cuda.5.free: 191.5 GB torch.cuda.5.total: 192.0 GB torch.cuda.5.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.6.name: AMD Radeon Graphics torch.cuda.6.free: 191.5 GB torch.cuda.6.total: 192.0 GB torch.cuda.6.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.7.name: AMD Radeon Graphics torch.cuda.7.free: 191.5 GB torch.cuda.7.total: 192.0 GB torch.cuda.7.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)llama_cpp_python: llama_cpp_python.version: 0.3.6 llama_cpp_python.supports_gpu_offload: False
Bug impact
- Training cannot be run.
Known workaround
- Raise the 2 instances of `max_startup_attempts` in ilab config.yaml from 120 to 1200.
Additional context
ilab model serve and ilab model chat worked, but nearly timed out. They succeeded on attempt 117 out of 120.
- clones
-
AIPCC-726 RHEL AI 1.4.3 - vllm fails to start on AMD for SDG - Ready to be tested
-
- Closed
-
- is caused by
-
AIPCC-1500 RHEL AI - podman's fuse-overlayfs has heavy CPU usage and hangs the vLLM server startup
-
- In Progress
-
- is cloned by
-
AIPCC-1500 RHEL AI - podman's fuse-overlayfs has heavy CPU usage and hangs the vLLM server startup
-
- In Progress
-