Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-2530

Subset Selection [dev preview]

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected
    • 0% To Do, 0% In Progress, 100% Done

      Feature Overview (mandatory - Complete while in New status)

      • When users generate a lot of samples, they will have the option to run subset selection method to get a minimal set of samples representative of original dataset. 
      • Subset Selection algorithm as developed by research computes embeddings of the samples and then tries to iteratively find a minimal subset which maximizes the coverage of the dataset.
      • Currently, subset selection uses an external embedding model - Snowflake Arctic (https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0

      Goals (mandatory - Complete while in New status)
      Provide high-level goal statement, providing user context and expected user outcome(s) for this Feature

      • End user with a large dataset, which might require higher compute, training time, other resources can reduce the size of the input dataset 

      Requirements (mandatory -_ Complete while in Refinement status):

      • Given an input dataset, subset selection outputs a smaller set representative of the original dataset. 

      Done - Acceptance Criteria (mandatory - Complete while in Refinement status):

      • Output of subset selection is representative of original dataset
      • Smoke test to verify few use cases with subset selection (model is being trained efficiently)

      Use Cases - i.e. User Experience & Workflow: (Initial completion while in Refinement status):

      Out of Scope {}{}(Initial completion while in Refinement status):

      N/A

      Documentation Considerations {}{}(Initial completion while in Refinement status):

      • Document how to use subset selection for internal teams/POCs.

      Questions to Answer {}{}(Initial completion while in Refinement status):

      • Arctic models needs to be included downstream as a model -  included in the model product inventory, separate Jira tracking the issue.
      • Current repo shared by research - has no license, working with akasriva for the requisite licenses. 
      • [Dependencies]:
         1. [Ben] For subset selection, at least one of the dependencies (submodlib) uses C++ code, so just a heads up that it will take longer than usual to vet this dependency and higher risk that things get slowed down or rejected somewhere.
         2.  [Ben] Also faiss-gpu if we're not already packaging that. And, we'll have to figure out whether these libraries will also work with AMD  and Intel or only work on Nvidia cards.
         
         

      Customer Considerations {}{}(Initial completion while in Refinement status):

      • DDIS AI team (our customer 0) is trying a use case to train a model with RHSC data with ~2000 pdf files. The number of knowledge samples generated after SDG exceed the max knowledge samples supported today. Subset selection feature will unblock such use cases by exposing a capability to map a large dataset to a 

      Recordings:

      https://drive.google.com/file/d/1VhRLknJwEYpsDzcAQDw6WHfenN_3oVQR/view?usp=drive_link

      • Sync with Shiv where he discusses integration of subset selection into recipe.yaml with a subset_method. 

              rh-ee-asaluja Aditi Saluja
              rh-ee-asaluja Aditi Saluja
              Aakanksha Duggal, Jehlum Vitasta Pandit
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: