-
Spike
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
False
-
-
False
-
-
The overall mission for this work item is to do the following comparisons:
- granite-starter instructlab trained + RAG vs llama (off-the-shelf, not trained) with no RAG, potentially add in mistral (off the shelf not trained) with no RAG.
- granite-starter instructlab trained + RAG vs llama (off-the-shelf, not trained) + RAG vs mistral (off-the-shelf, not trained) + RAG
Tasks include:
- Getting access to the documents for as many POC's as possible.
- For each of them creating a benchmark data set that's large enough to reliably measure distinctions like the ones requested in this work item.
- Standing up a RAG capability for conducting the tests. Note that some of the POC's are heavily focused on tables, so it's important for the capability to be reasonably competent at extracting answers from tables. There are conflicting examples from IBM about how to do that well using Docling, so more investigation is needed in this area.
- Measuring how effective the models are on the benchmark data sets.
Steven asked Mo to staff this, and Mo assigned it to me. Since it didn't come through the PMs, no PM made a Jira entry for it. So Mo told me that I was welcome to make one of my own – this is that Jira entry. I'm leaving the priority undefined for now because there hasn't been any clear indication of what the priority is.