Uploaded image for project: 'RH Developer Hub Planning'
  1. RH Developer Hub Planning
  2. RHDHPLAN-261

[Lightspeed] Evaluations - testing accuracy and efficacy across models

Create Doc EPIC from R...Prepare for Z ReleasePrepare Test Plan (Y R...XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • 80% To Do, 20% In Progress, 0% Done
    • L

      Feature Overview (aka. Goal Summary)

      As we develop the Lightspeed plugin and make it more feature rich, one of the aspect that we haven't spend too much time on is the evaluation of the accuracy of the model. Making use of some evaluation tools and providing standard data for evaluation will help us to standardize evaluating the accuracy of responses to help us to identify ways for us and the user to improve the accuracy.

      Leverage the common Lightspeed Core evaluation framework to ensure the Lightspeed core functionality and reference implementations maintain quality and compatibility across various Large Language Model (LLM) providers, making the evaluation process part of regular continuous integration (CI) and Quality Engineering (QE) testing.

      Goals (aka. expected user outcomes)

      The usage of the evaluation tools will help us to:

      1. Evaluating the area of weakest to help us identify the area of improvement, e.g. documentation topics that we need to improve for RAG
      2. Evaluate the model to help us to identify the models that we recommends
      3. Provide a set of data that standardize the evaluation of the Lightspeed plugin
      4. Provide a base line on the accuracy of our Lightspeed plugin using our recommended model(s)
      5. Help users evaluating their the BOYM and compare with our recommended model using the evaluation tool and the standard data set
      6. Implements automated model evaluation against Lightspeed core utilizing the Lightspeed Core Evaluation Tool. The evaluation framework is designed to be flexible enough to configure different setups using configuration files like system.yaml (for configuring the Judge LLM and API access) and eval_data.yaml (for conversation data and expected responses).

      Requirements (aka. Acceptance Criteria):

      Lightspeed core has developed a Lightspeed eval tools (https://docs.google.com/presentation/d/1FWJta5h_GRpxJrOPSaoV_BNrx64VwSXFQv5wyRe-FVA/edit?slide=id.g37dcd7f6fa3_0_449#slide=id.g37dcd7f6fa3_0_449 and https://github.com/lightspeed-core/lightspeed-evaluation) that helps Lightspeed products to evaluate the accuracy of the responses. Other reference: OS evaluation framework: https://docs.google.com/presentation/d/1BHbksKcnpC5LbOxoH5nUT6yozz-djoOQJd53HWDkG6U/edit?slide=id.p#slide=id.p 

      This work aligns with the development of a common Lightspeed Core evaluation framework (tracked by LCORE-56), intended to provide a standardized tool for a wide range of evaluations. The evaluation should focus on application-level evaluation (AI App Eval) rather than solely benchmarking.

      We will need to:

      1. Investigation how to run the Lightspeed evaluation tools with the Lightspeed plugin
      2. Create a data set (Q&A) specific to RHDH Lightspeed plugin that can be feed into the evaluation tools to assess the accuracy of the responses
      3. Provide recommended models with benchmark accuracy number for running RHDH Lightspeed based on the results of the Lighspeed evaluation with the standard data set
        1. 1 medium/large model for running in a cluster
        2. 1 small model for running on local
      4. Provide instructions for user on how to run the evaluation tools with the standard data set
      5. Stretched: Provide instructions for the user to customize the data set to help them to evaluate their model in case of BYOK and BYO MCP in the future
      6. Add Model Evaluation to Release QE Testing:
        The testing scenarios should include: # Lightspeed Core integration with Gemini API: Evaluation against Lightspeed core configured to integrate with the Gemini API. While Gemini's use in products requires testing/validation, the evaluation tool supports configurable API Integration.

      Lightspeed Core integration with Open Weight Models: Evaluation against Lightspeed core configured with at least two popular open-weight LLMs, such as Llama, Mistral, or Gemma. The evaluation framework supports assessing compatibility with other small language models. Open-weight models like Llama and Mistral are currently being investigated and considered for Lightspeed integration.

      Reports & Report Retention
       
      A standard for reports should be created as part of this work.
      Reports should be generated in this standard format
      Reports should be retained between RHDH releases so that we can perform analysis over time

      Out of Scope (Optional)

      High-level list of items that are out of scope.

      1. Data set for other products, e.g. RHOAI, etc.
      2. Evaluation on the questions/answers that outside of the scope of the recommended usage of RHDH Lightspeed

      Customer Considerations (Optional)

      Provide any additional customer-specific considerations that must be made
      when designing and delivering the Feature. Initial completion during
      Refinement status.

      1. Expandable to allow user to evaluate models that are tailored for their organization

      Documentation Considerations

      Provide information that needs to be considered and planned so that
      documentation will meet customer needs. If the feature extends existing
      functionality, provide a link to its current documentation.

      1. How to set up and run the evaluation tools with RHDH Lightspeed
      2. How to and where to download standard data set for RHDH Lightspeed evaluation

              yangcao Stephanie Cao
              eyuen@redhat.com Elson Yuen
              RHIDP - AI
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: