-
Initiative
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
Product / Portfolio Work
-
-
False
-
-
False
-
None
-
None
-
None
Goal
Evaluate the LLM model's performance in assisting users with end-to-end OpenShift deployment on a specified platform and OpenShift related actions/operations.
Benefit Hypothesis (Why):
- Identify what works: Track which models and configurations perform best for specific tasks (accuracy, response time, user satisfaction).
- Hallucination Detection: Ensuring factual accuracy and minimizing the generation of false information.
- Retrieval Relevance: Verifying that our RAG system pulls the most pertinent information to ground the model's response.
- Toxicity Detection: Filtering for and eliminating harmful or inappropriate content.
- Summarization Performance: Evaluating the coherence, accuracy, and conciseness of summaries.
- Code Generation: Checking for the correctness and readability of generated code, include install configs and manifests.
- Spot degradation: The evaluation would also elude to when performance drops over time due to data drift or model updates.
- Catch edge cases: Document how the chat assistant handles unusual inputs, errors, or boundary conditions
- Regulatory requirements: Many industries require documented testing for AI systems (healthcare, finance, etc.)
- Audit trails: Provide evidence of due diligence in model selection and validation.
- Risk management: Document potential failure modes and mitigation strategies.
- Maintain user trust: Provide users documentation so users can make a data driven decision to choose the model and provide some level of confidence prior to deploying the solution.
- ROI demonstration: Show RH stakeholders and customers improvement in performance metrics over time.
- Resource allocation: Make informed decisions about where to invest development effort.
- Competitive advantage: Systematic testing and documentation leads to better products.
Resources
- Assisted MCP with local model running on CPU
- https://github.com/IBM/ITBench (Abstract)
- k8s-bench - a benchmark to evaluate performance of different LLM models on kubernetes related tasks that is part of the kubectl-ai project.
- This is likely very similar in methodology as that intended to be used by RHEL Lightspeed to evaluate their quality in relation to RHEL install.
- LCORE-56 (Establish a common benchmarks and practices for LLM evaluations)
Responsibilities
Evaluation (Process Quality & Optimization) workstream – see Conversational Installation Experience for OpenShift.
Success Criteria
- A delivery pipeline that gives a very specific measurement of the quality of the results of the AI-assisted installation of OpenShift for the model used.
Results
Add results here once the Initiative is started. Recommend discussions & updates once per quarter in bullets.