-
Feature
-
Resolution: Done
-
Major
-
None
-
False
-
-
False
-
Not Selected
-
0% To Do, 0% In Progress, 100% Done
Feature Overview (mandatory - Complete while in New status)
- When users generate a lot of samples, they will have the option to run subset selection method to get a minimal set of samples representative of original dataset.
- Subset Selection algorithm as developed by research computes embeddings of the samples and then tries to iteratively find a minimal subset which maximizes the coverage of the dataset.
- Currently, subset selection uses an external embedding model - Snowflake Arctic (https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0 )
Goals (mandatory - Complete while in New status)
Provide high-level goal statement, providing user context and expected user outcome(s) for this Feature
- End user with a large dataset, which might require higher compute, training time, other resources can reduce the size of the input dataset
Requirements (mandatory -_ Complete while in Refinement status):
- Given an input dataset, subset selection outputs a smaller set representative of the original dataset.
Done - Acceptance Criteria (mandatory - Complete while in Refinement status):
- Output of subset selection is representative of original dataset
- Smoke test to verify few use cases with subset selection (model is being trained efficiently)
Use Cases - i.e. User Experience & Workflow: (Initial completion while in Refinement status):
Out of Scope {}{}(Initial completion while in Refinement status):
N/A
Documentation Considerations {}{}(Initial completion while in Refinement status):
- Document how to use subset selection for internal teams/POCs.
Questions to Answer {}{}(Initial completion while in Refinement status):
- Arctic models needs to be included downstream as a model - included in the model product inventory, separate Jira tracking the issue.
- Current repo shared by research - has no license, working with akasriva for the requisite licenses.
- [Dependencies]:
1. [Ben] For subset selection, at least one of the dependencies (submodlib) uses C++ code, so just a heads up that it will take longer than usual to vet this dependency and higher risk that things get slowed down or rejected somewhere.
2. [Ben] Also faiss-gpu if we're not already packaging that. And, we'll have to figure out whether these libraries will also work with AMD and Intel or only work on Nvidia cards.
Customer Considerations {}{}(Initial completion while in Refinement status):
- DDIS AI team (our customer 0) is trying a use case to train a model with RHSC data with ~2000 pdf files. The number of knowledge samples generated after SDG exceed the max knowledge samples supported today. Subset selection feature will unblock such use cases by exposing a capability to map a large dataset to a
Recordings:
https://drive.google.com/file/d/1VhRLknJwEYpsDzcAQDw6WHfenN_3oVQR/view?usp=drive_link
- Sync with Shiv where he discusses integration of subset selection into recipe.yaml with a subset_method.
- incorporates
-
AIPCC-453 Deliver Snowflake Arctic Embedding Model for Subset Selection
-
- Closed
-
-
RHELAI-2989 [model/sdg] Support Snowflake Arctic Embedding Model for Subset Selection
-
- Closed
-
- is blocked by
-
AIPCC-50 Build dependencies for subset selection
-
- Closed
-
- relates to
-
RHELAI-2968 Phase I: Filter by tag for pre-computed dataset filtering
-
- New
-