Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-2558

RHELAI 1.3 Intel: Chat response with llama2/granite stops in the middle of the phrase.

XMLWordPrintable

    • Critical
    • Approved

      To Reproduce Steps to reproduce the behavior:

      1. On Intel Gaudi, with serve gpus and tensor params set to 1 run `ilab model chat`
      2. Ask a question
      3. >>> From what is composed the water ?                                                                                                             [S][default]
        ╭────────────────────────────────────────────────────────────────── granite-7b-redhat-lab ───────────────────────────────────────────────────────────────────╮
        │ Water is a fascinating substance, and it is primarily composed of two elements: hydrogen and o                                                             │
        ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── elapsed 201.699 seconds ─╯

      >>>                                                                                                                                               

      1. same for `rm -rf /`
      2. >>> rm -rf /                                                                             [S][default]
        ╭────────────────────────────────────── granite-7b-redhat-lab ──────────────────────────────────────╮
        │ ```                                                                                               │
        ╰───────────────────────────────────────────────────────────────────────── elapsed 118.725 seconds ─╯

      Tried to run serve and chat in different tmux sessions, same behavior:

      >>> Fromm what is water composed ?                                                       [S][default]
      ╭────────────────────────────────────── granite-7b-redhat-lab ──────────────────────────────────────╮
      │ W                                                                                                 │
      ╰─────────────────────────────────────────────────────────────────────────── elapsed 6.009 seconds ─╯

      serve log:

      INFO 12-05 14:05:37 logger.py:36] Received request chat-e12563e5bb954b059a0b657313104c5d: prompt: '<|system|>\nI am, Red Hat® Instruct Model based on Granite 7B, an AI language model developed by Red Hat and IBM Research, based on the Granite-7b-base language model. My primary function is to be a chat assistant.\n<|user|>\nFromm what is water composed ?\n<|user|>\nFromm what is water composed ?\n<|assistant|>\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=None, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [32003, 29871, 13, 29902, 626, 29892, 4367, 25966, 30342, 2799, 1247, 8125, 2729, 373, 6274, 568, 29871, 29955, 29933, 29892, 385, 319, 29902, 4086, 1904, 8906, 491, 4367, 25966, 322, 27955, 10550, 29892, 2729, 373, 278, 6274, 568, 29899, 29955, 29890, 29899, 3188, 4086, 1904, 29889, 1619, 7601, 740, 338, 304, 367, 263, 13563, 20255, 29889, 13, 32004, 29871, 13, 4591, 29885, 825, 338, 4094, 13725, 1577, 13, 32004, 29871, 13, 4591, 29885, 825, 338, 4094, 13725, 1577, 13, 32005, 29871, 13], lora_request: None, prompt_adapter_request: None.
      INFO:     127.0.0.1:41458 - "POST /v1/chat/completions HTTP/1.1" 200 OK
      INFO 12-05 14:05:37 async_llm_engine.py:173] Added request chat-e12563e5bb954b059a0b657313104c5d.
      INFO 12-05 14:05:37 metrics.py:406] Avg prompt throughput: 16.4 tokens/s, Avg generation throughput: 5.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 1.6%.
      INFO 12-05 14:05:42 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 44.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 1.6%.

      Expected behavior

      • <your text here>

      Screenshots

      • Attached Image 

      Device Info (please complete the following information):

      • Hardware Specs: [e.g. Apple M2 Pro Chip, 16 GB Memory, etc.]
      • OS Version: [e.g. Mac OS 14.4.1, Fedora Linux 40]
      • InstructLab Version: [output of \\\{{{}ilab --version{}}}]
      • Provide the output of these two commands:
      • [root@localhost ~]# bootc status --format json | jq .status.booted.image.image.image
        "registry.redhat.io/rhelai1/bootc-intel-rhel9:1.3-1733319681" 
        • ilab system info to print detailed information about InstructLab version, OS, and hardware - including GPU / AI accelerator hardware
        • [root@localhost ~]# ilab system info
          /usr/lib64/python3.11/inspect.py:389: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
            return isinstance(object, types.FunctionType)
          ============================= HABANA PT BRIDGE CONFIGURATION =========================== 
           PT_HPU_LAZY_MODE = 1
           PT_RECIPE_CACHE_PATH = 
           PT_CACHE_FOLDER_DELETE = 0
           PT_HPU_RECIPE_CACHE_CONFIG = 
           PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
           PT_HPU_LAZY_ACC_PAR_MODE = 1
           PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
           PT_HPU_EAGER_PIPELINE_ENABLE = 1
           PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
          ---------------------------: System Configuration :---------------------------
          Num CPU Cores : 288
          CPU RAM       : -1919526024 KB
          ------------------------------------------------------------------------------
          Platform:
            sys.version: 3.11.7 (main, Oct  9 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
            sys.platform: linux
            os.name: posix
            platform.release: 5.14.0-427.42.1.el9_4.x86_64
            platform.machine: x86_64
            platform.node: localhost
            platform.python_version: 3.11.7
            os-release.ID: rhel
            os-release.VERSION_ID: 9.4
            os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
            memory.total: 2265.40 GB
            memory.available: 2246.60 GB
            memory.used: 10.00 GB
          InstructLab:
            instructlab.version: 0.21.0
            instructlab-dolomite.version: 0.2.0
            instructlab-eval.version: 0.4.1
            instructlab-quantize.version: 0.1.0
            instructlab-schema.version: 0.4.1
            instructlab-sdg.version: 0.6.1
            instructlab-training.version: 0.6.1
          Torch:
            torch.version: 2.4.0a0+git74cd574
            torch.backends.cpu.capability: AVX512
            torch.version.cuda: None
            torch.version.hip: None
            torch.cuda.available: False
            torch.backends.cuda.is_built: False
            torch.backends.mps.is_built: False
            torch.backends.mps.is_available: False
            habana_torch_plugin.version: 1.18.0.524
            torch.hpu.is_available: True
            torch.hpu.device_count: 8
            torch.hpu.0.name: GAUDI3
            torch.hpu.0.capability: 1.18.0.1b7f293
            torch.hpu.0.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=18, device_id=0, device_type=5
            torch.hpu.1.name: GAUDI3
            torch.hpu.1.capability: 1.18.0.1b7f293
            torch.hpu.1.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=18, device_id=0, device_type=5
            torch.hpu.2.name: GAUDI3
            torch.hpu.2.capability: 1.18.0.1b7f293
            torch.hpu.2.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=18, device_id=0, device_type=5
            torch.hpu.3.name: GAUDI3
            torch.hpu.3.capability: 1.18.0.1b7f293
            torch.hpu.3.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=18, device_id=0, device_type=5
            torch.hpu.4.name: GAUDI3
            torch.hpu.4.capability: 1.18.0.1b7f293
            torch.hpu.4.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=18, device_id=0, device_type=5
            torch.hpu.5.name: GAUDI3
            torch.hpu.5.capability: 1.18.0.1b7f293
            torch.hpu.5.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=18, device_id=0, device_type=5
            torch.hpu.6.name: GAUDI3
            torch.hpu.6.capability: 1.18.0.1b7f293
            torch.hpu.6.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=18, device_id=0, device_type=5
            torch.hpu.7.name: GAUDI3
            torch.hpu.7.capability: 1.18.0.1b7f293
            torch.hpu.7.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=18, device_id=0, device_type=5
            env.HABANA_LOGS: /var/log/habana_logs/
            env.HABANA_PLUGINS_LIB_PATH: /opt/habanalabs/habana_plugins
            env.HABANA_PROFILE: profile_api_light
            env.HABANA_SCAL_BIN_PATH: /opt/habanalabs/engines_fw
          llama_cpp_python:
            llama_cpp_python.version: 0.2.79
            llama_cpp_python.supports_gpu_offload: False
          [root@localhost ~]# 

      Additional context

      • <your text here>
      • ...
      • ...

              cheimes@redhat.com Christian Heimes
              cvultur@redhat.com Constantin Daniel Vultur
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: