Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-3025

Phase II: Extend the concept of subset selection with filter by tag

XMLWordPrintable

    • Icon: Feature Feature
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • InstructLab - SDG
    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected

      Feature Overview (mandatory - Complete while in New status)

      Extending the idea for subset selection, we want to allow users to drop data samples tagged with certain keys (example - summarization etc) from their input datasets, SDG generated datasets, pre-computed datasets or any others.

      This can be a pre-cursor/used in combination with data mixing and/or subset selection. 

      Goals (mandatory - Complete while in New status)
      Provide high-level goal statement, providing user context and expected user outcome(s) for this Feature

      • Allow users to customize their dataset(s) by keeping the samples for the topics they consider to be the most useful

      Requirements (mandatory -_ Complete while in Refinement status):

      • Given an input dataset and the tag, this will output a dataset without the samples mapped to that tag. 
      • Ensure consistency with tags and metadata for filtering data - add filters for source, topics etc (understand the RH AI team's existing capabilities for topics and include those and more)

      Done - Acceptance Criteria (mandatory - Complete while in Refinement status):

      • Expose this through CLI and SDK
      • This can be invoked at any time in the pipeline - pre and post-SDG

      Use Cases - i.e. User Experience & Workflow: (Initial completion while in Refinement status):

      Consider Labels and selectors in Kubernetes

      Out of Scope {}{}(Initial completion while in Refinement status):

      N/A

      Documentation Considerations {}{}(Initial completion while in Refinement status):
      Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation..
      <your text here>

      Questions to Answer {}{}(Initial completion while in Refinement status):

      1. What tags and metadata are we using that will help with dataset manipulation?
      2. How do we evolve ingestion, pre-computed datasets and SDG to ensure our datasets are tagged?
      3. How do we make this a part of the schema validation so that external datasets are also evaluated with these?
      4. Do we want to document this information to ensure users know how to use these tags and when to use which?

      Background and Strategic Fit (Initial completion while in Refinement status):

      Provide any additional context is needed to frame the feature.

      • The original use case for filter-by-tag is a part of recipes.yaml for data mixing - it allows users that want to train the model on custom skills to add a 'tag' of their choice in the recipes.yaml that indicates if they want to drop certain samples with those tags from the pre-computed dataset to ensure their custom skills are picked up

      Customer Considerations {}{}(Initial completion while in Refinement status):
      Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.
      <your text here>

              jepandit@redhat.com Jehlum Vitasta Pandit
              rh-ee-asaluja Aditi Saluja
              Aditi Saluja, Jehlum Vitasta Pandit
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: