Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-4263

[1.4.4 Nvidia]Error during ilab model evaluate

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Obsolete
    • Icon: Critical Critical
    • None
    • rhelai-1.4.4
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      To Reproduce Steps to reproduce the behavior:

      1. Run Single phase training - ilab model train -y --data-path <messages jsonl file> --is-padding-free False
      2. Run ilab model evaluate on random checkpoint generated - Cert Test suite ran command: ilab model evaluate --benchmark mmlu --model /root/.local/share/instructlab/checkpoints/hf_format/samples_8646 --enable-serving-output 
      3. It gives an error - ValueError: Out of range float values are not JSON compliant

      Expected behavior

      • The test should succeed

      Device Info (please complete the following information):

      • Hardware Specs: 4xL40s
      • OS Version: registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.4 , Version: 9.20250220.0, RHEL AI 1.4.4
      • InstructLab Version: instructlab.version: 0.23.5
      • Provide the output of these two commands:
        • ilab system info to print detailed information about InstructLab version, OS, and hardware – 
           
          ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
          ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
          ggml_cuda_init: found 4 CUDA devices:
            Device 0: NVIDIA L40S, compute capability 8.9, VMM: yes
            Device 1: NVIDIA L40S, compute capability 8.9, VMM: yes
            Device 2: NVIDIA L40S, compute capability 8.9, VMM: yes
            Device 3: NVIDIA L40S, compute capability 8.9, VMM: yes
          Platform:
            sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
            sys.platform: linux
            os.name: posix
            platform.release: 5.14.0-427.55.1.el9_4.x86_64
            platform.machine: x86_64
            platform.node: localhost.localdomain
            platform.python_version: 3.11.7
            os-release.ID: rhel
            os-release.VERSION_ID: 9.4
            os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
            memory.total: 3023.64 GB
            memory.available: 3007.63 GB
            memory.used: 4.18 GB
          InstructLab:
            instructlab.version: 0.23.5
            instructlab-dolomite.version: 0.2.0
            instructlab-eval.version: 0.5.1
            instructlab-quantize.version: 0.1.0
            instructlab-schema.version: 0.4.2
            instructlab-sdg.version: 0.7.3
            instructlab-training.version: 0.7.0
          Torch:
            torch.version: 2.5.1
            torch.backends.cpu.capability: AVX512
            torch.version.cuda: 12.4
            torch.version.hip: None
            torch.cuda.available: True
            torch.backends.cuda.is_built: True
            torch.backends.mps.is_built: False
            torch.backends.mps.is_available: False
            torch.cuda.bf16: True
            torch.cuda.current.device: 0
            torch.cuda.0.name: NVIDIA L40S
            torch.cuda.0.free: 43.9 GB
            torch.cuda.0.total: 44.3 GB
            torch.cuda.0.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.1.name: NVIDIA L40S
            torch.cuda.1.free: 43.9 GB
            torch.cuda.1.total: 44.3 GB
            torch.cuda.1.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.2.name: NVIDIA L40S
            torch.cuda.2.free: 43.9 GB
            torch.cuda.2.total: 44.3 GB
            torch.cuda.2.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.3.name: NVIDIA L40S
            torch.cuda.3.free: 43.9 GB
            torch.cuda.3.total: 44.3 GB
            torch.cuda.3.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
          llama_cpp_python:
            llama_cpp_python.version: 0.3.2
            llama_cpp_python.supports_gpu_offload: True
          

      Error

       

      INFO 05-26 21:06:51 engine.py:267] Added request cmpl-2a9f02652b8d4922a6bd3aff5a4e798b-0.
      INFO:     127.0.0.1:45666 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
      ERROR:    Exception in ASGI application
        + Exception Group Traceback (most recent call last):
        |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/_utils.py", line 76, in collapse_excgroups
        |     yield
        |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/middleware/base.py", line 178, in __call__
        |     async with anyio.create_task_group() as task_group:
        |   File "/opt/app-root/lib64/python3.11/site-packages/anyio/_backends/_asyncio.py", line 767, in __aexit__
        |     raise BaseExceptionGroup(
        | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
        +-+---------------- 1 ----------------
          | Traceback (most recent call last):
          |   File "/opt/app-root/lib64/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 409, in run_asgi
          |     result = await app(  # type: ignore[func-returns-value]
          |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          |   File "/opt/app-root/lib64/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
          |     return await self.app(scope, receive, send)
          |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          |   File "/opt/app-root/lib64/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
          |     await super().__call__(scope, receive, send)
          |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/applications.py", line 112, in __call__
          |     await self.middleware_stack(scope, receive, send)
          |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/middleware/errors.py", line 187, in __call__
          |     raise exc
          |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/middleware/errors.py", line 165, in __call__
          |     await self.app(scope, receive, _send)
          |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/middleware/base.py", line 177, in __call__
          |     with recv_stream, send_stream, collapse_excgroups():
          |   File "/usr/lib64/python3.11/contextlib.py", line 158, in __exit__
          |     self.gen.throw(typ, value, traceback)
          |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/_utils.py", line 82, in collapse_excgroups
          |     raise exc
          |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/middleware/base.py", line 179, in __call__
          |     response = await self.dispatch_func(request, call_next)
          |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          |   File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 490, in add_request_id
          |     response = await call_next(request)
          |                ^^^^^^^^^^^^^^^^^^^^^^^^
          |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/middleware/base.py", line 154, in call_next
          |     raise app_exc
          |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/middleware/base.py", line 141, in coro
          |     await self.app(scope, receive_or_disconnect, send_no_error)
          |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
          |     await self.app(scope, receive, send)
          |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
          |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
          |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
          |     raise exc
          |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
          |     await app(scope, receive, sender)
          |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/routing.py", line 715, in __call__
          |     await self.middleware_stack(scope, receive, send)
          |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/routing.py", line 735, in app
          |     await route.handle(scope, receive, send)
          |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/routing.py", line 288, in handle
          |     await self.app(scope, receive, send)
          |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/routing.py", line 76, in app
          |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
          |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
          |     raise exc
          |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
          |     await app(scope, receive, sender)
          |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/routing.py", line 73, in app
          |     response = await f(request)
          |                ^^^^^^^^^^^^^^^^
          |   File "/opt/app-root/lib64/python3.11/site-packages/fastapi/routing.py", line 301, in app
          |     raw_response = await run_endpoint_function(
          |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          |   File "/opt/app-root/lib64/python3.11/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
          |     return await dependant.call(**values)
          |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          |   File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 371, in create_completion
          |     return JSONResponse(content=generator.model_dump())
          |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/responses.py", line 181, in __init__
          |     super().__init__(content, status_code, headers, media_type, background)
          |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/responses.py", line 44, in __init__
          |     self.body = self.render(content)
          |                 ^^^^^^^^^^^^^^^^^^^^
          |   File "/opt/app-root/lib64/python3.11/site-packages/starlette/responses.py", line 184, in render
          |     return json.dumps(
          |            ^^^^^^^^^^^
          |   File "/usr/lib64/python3.11/json/__init__.py", line 238, in dumps
          |     **kw).encode(obj)
          |           ^^^^^^^^^^^
          |   File "/usr/lib64/python3.11/json/encoder.py", line 200, in encode
          |     chunks = self.iterencode(o, _one_shot=True)
          |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          |   File "/usr/lib64/python3.11/json/encoder.py", line 258, in iterencode
          |     return _iterencode(o, 0)
          |            ^^^^^^^^^^^^^^^^^
          | ValueError: Out of range float values are not JSON compliant
          +------------------------------------
      During handling of the above exception, another exception occurred:
      Traceback (most recent call last):
        File "/opt/app-root/lib64/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 409, in run_asgi
          result = await app(  # type: ignore[func-returns-value]
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
          return await self.app(scope, receive, send)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
          await super().__call__(scope, receive, send)
        File "/opt/app-root/lib64/python3.11/site-packages/starlette/applications.py", line 112, in __call__
          await self.middleware_stack(scope, receive, send)
        File "/opt/app-root/lib64/python3.11/site-packages/starlette/middleware/errors.py", line 187, in __call__
          raise exc
        File "/opt/app-root/lib64/python3.11/site-packages/starlette/middleware/errors.py", line 165, in __call__
          await self.app(scope, receive, _send)
        File "/opt/app-root/lib64/python3.11/site-packages/starlette/middleware/base.py", line 177, in __call__
          with recv_stream, send_stream, collapse_excgroups():
        File "/usr/lib64/python3.11/contextlib.py", line 158, in __exit__
          self.gen.throw(typ, value, traceback)
        File "/opt/app-root/lib64/python3.11/site-packages/starlette/_utils.py", line 82, in collapse_excgroups
          raise exc
        File "/opt/app-root/lib64/python3.11/site-packages/starlette/middleware/base.py", line 179, in __call__
          response = await self.dispatch_func(request, call_next)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 490, in add_request_id
          response = await call_next(request)
                     ^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/starlette/middleware/base.py", line 154, in call_next
          raise app_exc
        File "/opt/app-root/lib64/python3.11/site-packages/starlette/middleware/base.py", line 141, in coro
          await self.app(scope, receive_or_disconnect, send_no_error)
        File "/opt/app-root/lib64/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
          await self.app(scope, receive, send)
        File "/opt/app-root/lib64/python3.11/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
          await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
        File "/opt/app-root/lib64/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
          raise exc
        File "/opt/app-root/lib64/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
          await app(scope, receive, sender)
        File "/opt/app-root/lib64/python3.11/site-packages/starlette/routing.py", line 715, in __call__
          await self.middleware_stack(scope, receive, send)
        File "/opt/app-root/lib64/python3.11/site-packages/starlette/routing.py", line 735, in app
          await route.handle(scope, receive, send)
        File "/opt/app-root/lib64/python3.11/site-packages/starlette/routing.py", line 288, in handle
          await self.app(scope, receive, send)
        File "/opt/app-root/lib64/python3.11/site-packages/starlette/routing.py", line 76, in app
          await wrap_app_handling_exceptions(app, request)(scope, receive, send)
        File "/opt/app-root/lib64/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
          raise exc
        File "/opt/app-root/lib64/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
          await app(scope, receive, sender)
        File "/opt/app-root/lib64/python3.11/site-packages/starlette/routing.py", line 73, in app
          response = await f(request)
                     ^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/fastapi/routing.py", line 301, in app
          raw_response = await run_endpoint_function(
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
          return await dependant.call(**values)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 371, in create_completion
          return JSONResponse(content=generator.model_dump())
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/starlette/responses.py", line 181, in __init__
          super().__init__(content, status_code, headers, media_type, background)
        File "/opt/app-root/lib64/python3.11/site-packages/starlette/responses.py", line 44, in __init__
          self.body = self.render(content)
                      ^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/starlette/responses.py", line 184, in render
          return json.dumps(
                 ^^^^^^^^^^^
        File "/usr/lib64/python3.11/json/__init__.py", line 238, in dumps
          **kw).encode(obj)
                ^^^^^^^^^^^
        File "/usr/lib64/python3.11/json/encoder.py", line 200, in encode
          chunks = self.iterencode(o, _one_shot=True)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/usr/lib64/python3.11/json/encoder.py", line 258, in iterencode
          return _iterencode(o, 0)
                 ^^^^^^^^^^^^^^^^^
      ValueError: Out of range float values are not JSON compliant
      

       

      Bug impact

      • Certification for partner hardware is pending

      Known workaround

      • None yet

        1. evaluation_mmlu (2).log
          5.73 MB
        2. SDG (3).log
          168 kB
        3. single_phase_train (4).log
          81 kB

              Unassigned Unassigned
              rh-ee-aturate Aman Turate
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: