• Icon: Task Task
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • Test Infrastructure
    • 5
    • False
    • Hide

      None

      Show
      None
    • False
    • RHDHPAI Sprint 3268

      Task Description (Required)

       

      We already have the granite3-dense:8b AKA Granite-3.0-8B-Instruct model hosted on ollama on our dev RHOAI cluster.

      This request is to host it using VLLM.

      The disadvantage of hosting models in ollama is the switching cost.  Every time someone makes an inference API call for a model that's not loaded into memory, llamacpp/ollama swaps out the old model and swaps in the new one.  For larger models this swap-out time can be noticeable.

      Steps

       

      • Onboard a new node with a GPU. A smaller one should be sufficient.
      • Create a new VLLM model inference server for Granite-3.0-8B-Instruct.
      • Register/advertise the new inference server through the API gateway.
      • Test.

       

       

              johnmcollier John Collier
              johnmcollier John Collier
              RHIDP - AI
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: