-
Spike
-
Resolution: Unresolved
-
Undefined
-
None
-
False
-
-
False
-
RHELAI-2530 - Subset Selection
-
-
[2771326597] Upstream Reporter: Kim
Upstream issue status: Open
Upstream description:
Feature Overview:
- When users generate a lot of samples, they will have the option to run subset selection method to get a minimal set of samples representative of original dataset.
- Subset Selection algorithm as developed by research computes embeddings of the samples and then tries to iteratively find a minimal subset which maximizes the coverage of the dataset.
Goals (mandatory - Complete while in New status)
- Provide high-level goal statement, providing user context and expected user outcome(s) for this Feature
- End user with a large dataset, which might require higher compute, training time, other resources can reduce the size of the input dataset
Requirement:
- Given an input dataset, subset selection outputs a smaller set representative of the original dataset.
Done - Acceptance Criteria:
- Output of subset selection is representative of original dataset
- Smoke test to verify few use cases with subset selection (model is being trained efficiently)
Upstream URL: https://github.com/instructlab/instructlab/issues/2857
- links to