Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-8758

Vllm 0.13.0 RHAIIS 3.3 fails to detect rocm device

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • RHAIIS-3.3
    • Accelerator Enablement
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • AIPCC Accelerators 24

      Steps to reproduce the behavior:

      podman login quay.io
      podman pull  quay.io/aipcc/base-images/rocm-6.4-el9.6
            3. Create podman container 
      podman run -dit \
       --name aipccrocm \
       --device=/dev/kfd \
       --device=/dev/dri \
       --security-opt seccomp=unconfined \
       --security-opt apparmor=unconfined \
       --group-add video \
       -v /home:/home \
       --user root \
       quay.io/aipcc/base-images/rocm-6.4-el9.6
      

      4. Create Requirement file from RHAIIS rocm collections. → https://gitlab.com/redhat/rhel-ai/rhaiis/pipeline/-/blob/main/collections/rhaiis/rocm-ubi9/requirements.txt?ref_type=heads

      5. Pip install RHAIIS rocm wheels

      NETRC=.netrc pip install --only-binary :all:   --index-url https://gitlab.com/api/v4/projects/75894209/packages/pypi/simple/  --trusted-host gitlab.com   -r req.txt   

      Logs:

      root@enc1-gpuvm021:/home/hotaisle# podman images
      REPOSITORY                                TAG         IMAGE ID      CREATED       SIZE
      quay.io/aipcc/base-images/rocm-6.4-el9.6  latest      20343085fb95  42 hours ago  16.2 GB 
      (.venv) (app-root) /opt/app-root/wheels-test$ ROCM_HOME=/opt/rocm ROCM_PATH=/opt/rocm python -c "
      import os
      print('ROCM_HOME:', os.environ.get('ROCM_HOME'))
      print('ROCM_PATH:', os.environ.get('ROCM_PATH'))
      
      from vllm.platforms import current_platform
      print('Platform:', current_platform)
      print('Device type:', repr(current_platform.device_type))
      
      # If platform detected, try LLM
      if current_platform.device_type:
          from vllm import LLM
          llm = LLM('Qwen/Qwen3-0.6B', max_model_len=256)
          print('vLLM works!')
      else:
          print('Platform still not detected')
      "
      ROCM_HOME: /opt/rocm
      ROCM_PATH: /opt/rocm
      Platform: <vllm.platforms.interface.UnspecifiedPlatform object at 0x7f852f754f80>
      Device type: ''
      Platform still not detected
      (.venv) (app-root) /opt/app-root/wheels-test$ 

      Probe-Test failure:

          def test_stability_loop(self, current_accelerator, current_architecture):
              """
              Test vllm stability with multiple inference cycles.
          
              This test runs multiple inference iterations to detect memory leaks
              or stability issues.
              """
              logger.info("")
              logger.info("=" * 70)
              logger.info("STABILITY LOOP TEST")
              logger.info("=" * 70)
              logger.info("Accelerator: %s", current_accelerator)
              logger.info("Architecture: %s", current_architecture)
          
              # Skip on CPU
              if current_accelerator == AcceleratorType.CPU:
                  logger.info("Skipping: vLLM stability test requires GPU or TPU")
                  pytest.skip("stability test requires GPU or TPU")
          
              if "vllm" not in registry.tests:
                  logger.warning("vLLM tests not registered - skipping")
                  pytest.skip("vllm tests not registered")
          
              tests = registry.tests["vllm"].get(Category.STABILITY, [])
              if not tests:
                  logger.warning("No stability tests available - skipping")
                  pytest.skip("No stability tests available")
          
              logger.info("Running %d stability test(s)...", len(tests))
          
              results = []
              for func_name, test_func in tests:
                  logger.info("   Testing: %s", func_name)
                  result = test_func(current_accelerator, current_architecture)
                  results.append(result)
          
                  if result.success:
                      logger.info("      ✓ PASSED (%.3fs)", result.execution_time)
                  else:
                      logger.error("      ✗ FAILED: %s", result.error_message)
          
              # At least one stability test should pass
              success_count = sum(1 for r in results if r.success)
          
              logger.info("")
              logger.info("Test Results: %d/%d passed", success_count, len(results))
              logger.info("=" * 70)
          
              error_msg = results[0].error_message if results else "no tests run"
      >       assert success_count > 0, f"All stability tests failed: {error_msg}"
      E       AssertionError: All stability tests failed: Device string must not be empty
      E       assert 0 > 0
      
      probe-tests/probe-vllm/vllm_probe_test.py:1265: AssertionError
      ------------------------------------------------------------------ Captured stdout call -------------------------------------------------------------------
      2026-01-16 12:41:11,796 | vllm_probe_test | INFO | 
      2026-01-16 12:41:11,796 - vllm_probe_test - INFO - 
      2026-01-16 12:41:11,796 | vllm_probe_test | INFO | ======================================================================
      2026-01-16 12:41:11,796 - vllm_probe_test - INFO - ======================================================================
      2026-01-16 12:41:11,796 | vllm_probe_test | INFO | STABILITY LOOP TEST
      2026-01-16 12:41:11,796 - vllm_probe_test - INFO - STABILITY LOOP TEST
      2026-01-16 12:41:11,796 | vllm_probe_test | INFO | ======================================================================
      2026-01-16 12:41:11,796 - vllm_probe_test - INFO - ======================================================================
      2026-01-16 12:41:11,796 | vllm_probe_test | INFO | Accelerator: AcceleratorType.ROCM
      2026-01-16 12:41:11,796 - vllm_probe_test - INFO - Accelerator: AcceleratorType.ROCM
      2026-01-16 12:41:11,796 | vllm_probe_test | INFO | Architecture: ArchitectureType.X86
      2026-01-16 12:41:11,796 - vllm_probe_test - INFO - Architecture: ArchitectureType.X86
      2026-01-16 12:41:11,796 | vllm_probe_test | INFO | Running 1 stability test(s)...
      2026-01-16 12:41:11,796 - vllm_probe_test - INFO - Running 1 stability test(s)...
      2026-01-16 12:41:11,796 | vllm_probe_test | INFO |    Testing: stability_loop
      2026-01-16 12:41:11,796 - vllm_probe_test - INFO -    Testing: stability_loop
      2026-01-16 12:41:11,797 - vllm.entrypoints.utils - INFO - non-default args: {'max_model_len': 256, 'gpu_memory_utilization': 0.5, 'disable_log_stats': True, 'enforce_eager': True}
      2026-01-16 12:41:12,204 | vllm_probe_test | ERROR |       ✗ FAILED: Device string must not be empty
      2026-01-16 12:41:12,204 - vllm_probe_test - ERROR -       ✗ FAILED: Device string must not be empty
      2026-01-16 12:41:12,204 | vllm_probe_test | INFO | 
      2026-01-16 12:41:12,204 - vllm_probe_test - INFO - 
      2026-01-16 12:41:12,204 | vllm_probe_test | INFO | Test Results: 0/1 passed
      2026-01-16 12:41:12,204 - vllm_probe_test - INFO - Test Results: 0/1 passed
      2026-01-16 12:41:12,204 | vllm_probe_test | INFO | ======================================================================
      2026-01-16 12:41:12,204 - vllm_probe_test - INFO - ======================================================================
      -------------------------------------------------------------------- Captured log call --------------------------------------------------------------------
      INFO     vllm_probe_test:vllm_probe_test.py:1223 
      INFO     vllm_probe_test:vllm_probe_test.py:1224 ======================================================================
      INFO     vllm_probe_test:vllm_probe_test.py:1225 STABILITY LOOP TEST
      INFO     vllm_probe_test:vllm_probe_test.py:1226 ======================================================================
      INFO     vllm_probe_test:vllm_probe_test.py:1227 Accelerator: AcceleratorType.ROCM
      INFO     vllm_probe_test:vllm_probe_test.py:1228 Architecture: ArchitectureType.X86
      INFO     vllm_probe_test:vllm_probe_test.py:1244 Running 1 stability test(s)...
      INFO     vllm_probe_test:vllm_probe_test.py:1248    Testing: stability_loop
      INFO     vllm.entrypoints.utils:utils.py:253 non-default args: {'max_model_len': 256, 'gpu_memory_utilization': 0.5, 'disable_log_stats': True, 'enforce_eager': True}
      ERROR    vllm_probe_test:vllm_probe_test.py:1255       ✗ FAILED: Device string must not be empty
      INFO     vllm_probe_test:vllm_probe_test.py:1260 
      INFO     vllm_probe_test:vllm_probe_test.py:1261 Test Results: 0/1 passed
      INFO     vllm_probe_test:vllm_probe_test.py:1262 ======================================================================
      ==================================================================== warnings summary =====================================================================
      probe-tests/probe-vllm/vllm_probe_test.py::TestVLLMCore::test_import_and_abi_verification
        <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute
      
      probe-tests/probe-vllm/vllm_probe_test.py::TestVLLMCore::test_import_and_abi_verification
        <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute
      
      -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
      ================================================================= short test summary info =================================================================
      FAILED probe-tests/probe-vllm/vllm_probe_test.py::TestVLLMCore::test_basic_inference - AssertionError: All inference tests failed
      FAILED probe-tests/probe-vllm/vllm_probe_test.py::TestVLLMStability::test_stability_loop - AssertionError: All stability tests failed: Device string must not be empty
      =================================================== 2 failed, 4 passed, 3 skipped, 2 warnings in 11.30s ===================================================
      sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
      (.venv) (app-root) /opt/app-root/wheels-test$  

       

      Expected Result:
      vllm Probes tests should pass without error.

      Actual Result:
      Observing errors for probes tests ran for vllm which used to pass in older collections/releases.

              Unassigned Unassigned
              rh-ee-konagara Koushik Nagaraj
              Frank's Team
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: