-
Feature
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
None
-
None
-
False
-
-
False
-
Not Selected
-
50% To Do, 50% In Progress, 0% Done
Feature Overview (mandatory - Complete while in New status)
As part of the RAG artifacts in InstructLab, we need a RAG evaluation framework to measure the performance and quality of the RAG pipelines built for the capability. To maintain consistency with the research and OCTO work, we need downstream RAGAS as a dependency.
RAGAS is becoming the de facto standard for RAG evaluation. By productizing RAGAS within InstructLab, we can provide users with enterprise-grade evaluation capabilities while maintaining consistency with existing workflows across research, OCTO, and other internal works.
This card is for the work to integrate and productize the RAGAS evaluation framework into InstructLab to provide an automated, comprehensive assessment of RAG system performance. This feature will enable InstructLab to evaluate and optimize RAG implementations through standardized metrics and detailed performance insights.
Goals
- Primary use cases: internal RAG pipeline and POCs working with RAG + Fine-Tuned LLM
- Enable automated evaluation of RAG artifacts using industry-standard RAGAS metrics
- Expand existing model evaluation capabilities to include RAG-specific performance analysis
- Integrate seamlessly with current InstructLab workflows
Requirements
- Implement core RAGAS evaluation metrics:
- Answer Relevancy scoring
- Context Relevancy assessment
- Faithfulness measurement
- Answer Correctness validation
- Context Precision/Recall analysis
- Data Processing Requirements:
- It should use the docling / InstructLab ingestion pipeline
- Reports Requirements:
- Exportable reports in parquet or similar formats to be used outside
- Integration Requirements:
- Documentation and usage examples
Done:
- Core RAGAS metrics implemented and validated
- Integration tests passing with >75% coverage
- User interface components developed and tested
- Performance benchmarks established
- User documentation
- (stretched goal) Tutorials published
Questions to Answer
- What is the expected scale of concurrent evaluations?
- How will we handle version compatibility with upstream RAGAS?
- What level of customization should we allow for evaluation parameters?
- How will we store and manage evaluation results?
- What is the expected latency for real-time evaluations?
Out of Scope
- Custom LLM integration for evaluation
- Real-time streaming evaluation
- Automated optimization of RAG systems
- Integration with third-party evaluation frameworks
- Historical evaluation data migration
- Custom metric development
Customer Considerations
- Performance Impact: Minimize impact on existing system performance
- Learning Curve: Ensure a smooth learning curve for existing InstructLab users
- blocks
-
RHELAI-2374 [eval] Extending RAGAS Evaluation Framework with Additional Metrics
- New