Feature Overview (aka. Goal Summary)

As we develop the Lightspeed plugin and make it more feature rich, one of the aspect that we haven't spend too much time on is the evaluation of the accuracy of the model. Making use of some evaluation tools and providing standard data for evaluation will help us to standardize evaluating the accuracy of responses to help us to identify ways for us and the user to improve the accuracy.

Leverage the common Lightspeed Core evaluation framework to ensure the Lightspeed core functionality and reference implementations maintain quality and compatibility across various Large Language Model (LLM) providers, making the evaluation process part of regular continuous integration (CI) and Quality Engineering (QE) testing.

Goals (aka. expected user outcomes)

The usage of the evaluation tools will help us to:

Evaluating the area of weakest to help us identify the area of improvement, e.g. documentation topics that we need to improve for RAG
Evaluate the model to help us to identify the models that we recommends
Provide a set of data that standardize the evaluation of the Lightspeed plugin
Provide a base line on the accuracy of our Lightspeed plugin using our recommended model(s)
Help users evaluating their the BOYM and compare with our recommended model using the evaluation tool and the standard data set
Implements automated model evaluation against Lightspeed core utilizing the Lightspeed Core Evaluation Tool. The evaluation framework is designed to be flexible enough to configure different setups using configuration files like system.yaml (for configuring the Judge LLM and API access) and eval_data.yaml (for conversation data and expected responses).

Requirements (aka. Acceptance Criteria):

Lightspeed core has developed a Lightspeed eval tools (https://docs.google.com/presentation/d/1FWJta5h_GRpxJrOPSaoV_BNrx64VwSXFQv5wyRe-FVA/edit?slide=id.g37dcd7f6fa3_0_449#slide=id.g37dcd7f6fa3_0_449 and https://github.com/lightspeed-core/lightspeed-evaluation) that helps Lightspeed products to evaluate the accuracy of the responses. Other reference: OS evaluation framework: https://docs.google.com/presentation/d/1BHbksKcnpC5LbOxoH5nUT6yozz-djoOQJd53HWDkG6U/edit?slide=id.p#slide=id.p

This work aligns with the development of a common Lightspeed Core evaluation framework (tracked by LCORE-56), intended to provide a standardized tool for a wide range of evaluations. The evaluation should focus on application-level evaluation (AI App Eval) rather than solely benchmarking.

We will need to:

Investigation how to run the Lightspeed evaluation tools with the Lightspeed plugin
Create a data set (Q&A) specific to RHDH Lightspeed plugin that can be feed into the evaluation tools to assess the accuracy of the responses
Provide recommended models with benchmark accuracy number for running RHDH Lightspeed based on the results of the Lighspeed evaluation with the standard data set
1. 1 medium/large model for running in a cluster
2. 1 small model for running on local
Provide instructions for user on how to run the evaluation tools with the standard data set
Stretched: Provide instructions for the user to customize the data set to help them to evaluate their model in case of BYOK and BYO MCP in the future
Add Model Evaluation to Release QE Testing:
The testing scenarios should include: # Lightspeed Core integration with Gemini API: Evaluation against Lightspeed core configured to integrate with the Gemini API. While Gemini's use in products requires testing/validation, the evaluation tool supports configurable API Integration.

Lightspeed Core integration with Open Weight Models: Evaluation against Lightspeed core configured with at least two popular open-weight LLMs, such as Llama, Mistral, or Gemma. The evaluation framework supports assessing compatibility with other small language models. Open-weight models like Llama and Mistral are currently being investigated and considered for Lightspeed integration.

Reports & Report Retention

A standard for reports should be created as part of this work.
Reports should be generated in this standard format
Reports should be retained between RHDH releases so that we can perform analysis over time

Out of Scope (Optional)

High-level list of items that are out of scope.

Data set for other products, e.g. RHOAI, etc.
Evaluation on the questions/answers that outside of the scope of the recommended usage of RHDH Lightspeed

Customer Considerations (Optional)

Provide any additional customer-specific considerations that must be made
when designing and delivering the Feature. Initial completion during
Refinement status.

1. Expandable to allow user to evaluate models that are tailored for their organization

Documentation Considerations

Provide information that needs to be considered and planned so that
documentation will meet customer needs. If the feature extends existing
functionality, provide a link to its current documentation.

How to set up and run the evaluation tools with RHDH Lightspeed
How to and where to download standard data set for RHDH Lightspeed evaluation

Assignee:: Stephanie Cao

Reporter:: Elson Yuen

Team:: RHIDP - AI

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/09/24 7:24 PM

Updated:: 2025/11/13 9:40 AM

Details

Description

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Out of Scope (Optional)

Customer Considerations (Optional)

Documentation Considerations

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty