• Icon: Task Task
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • None
    • HAS Sprint 2264

      Task Description (Required)

      https://www.reddit.com/r/LocalLLaMA/comments/18g21af/vllm_vs_llamacpp/

      talks about the difference between vLLM and llama.cpp

       

      llama.cpp does better when lack off GPU or VRAM, but vLLM has better performance since it takes GPU

       

      The current software template uses llama.cpp, but the chatbot and codegen generates responses too slow.

       

      This issue is too investigate if vLLM is going to have a better performance with chatbot and codegen samples, and how feasible to adopt it into the ai software template

       

      If this requires Change Management, complete sections below: 

      Change Request 

       

      <Select which item is being changed>

       

      [ ]  Add New Tokens

      [ ]  Rotate Tokens

      [ ]  Remove Tokens

      [ ] Others: (specify)

       

        Environment

      <Select which environment the change is being made on.  If both, open a separate issue so changes are tracked in each environment>

       

      [ ]  Stage OR

      [ ]  Prod

       

        Backout Plan

      <State what steps are needed to roll back in case something goes wrong>

       

        Downtime

      <Is there any downtime for these changes?  If so, for how long>

       

        Risk Level

      <How risky is this change?>

       

        Testing

      <How are changes verified?>

       

        Communication

      <How are service owners or consumers notified of these changes?>

              yangcao Stephanie Cao
              eyuen@redhat.com Elson Yuen
              RHIDP - AI
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: