-
Feature
-
Resolution: Unresolved
-
Major
-
None
-
None
-
False
-
-
False
-
Not Selected
Feature Overview (mandatory - Complete while in New status)
Extending the idea for subset selection, we want to allow users to drop data samples tagged with certain keys (example - summarization etc) from their input datasets, SDG generated datasets, pre-computed datasets or any others.
This can be a pre-cursor/used in combination with data mixing and/or subset selection.
Goals (mandatory - Complete while in New status)
Provide high-level goal statement, providing user context and expected user outcome(s) for this Feature
- Allow users to customize their dataset(s) by keeping the samples for the topics they consider to be the most useful
Requirements (mandatory -_ Complete while in Refinement status):
- Given an input dataset and the tag, this will output a dataset without the samples mapped to that tag.
- Ensure consistency with tags and metadata for filtering data - add filters for source, topics etc (understand the RH AI team's existing capabilities for topics and include those and more)
Done - Acceptance Criteria (mandatory - Complete while in Refinement status):
- Expose this through CLI and SDK
- This can be invoked at any time in the pipeline - pre and post-SDG
Use Cases - i.e. User Experience & Workflow: (Initial completion while in Refinement status):
Consider Labels and selectors in Kubernetes
Out of Scope {}{}(Initial completion while in Refinement status):
N/A
Documentation Considerations {}{}(Initial completion while in Refinement status):
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation..
<your text here>
Questions to Answer {}{}(Initial completion while in Refinement status):
- What tags and metadata are we using that will help with dataset manipulation?
- How do we evolve ingestion, pre-computed datasets and SDG to ensure our datasets are tagged?
- How do we make this a part of the schema validation so that external datasets are also evaluated with these?
- Do we want to document this information to ensure users know how to use these tags and when to use which?
Background and Strategic Fit (Initial completion while in Refinement status):
Provide any additional context is needed to frame the feature.
- The original use case for filter-by-tag is a part of recipes.yaml for data mixing - it allows users that want to train the model on custom skills to add a 'tag' of their choice in the recipes.yaml that indicates if they want to drop certain samples with those tags from the pre-computed dataset to ensure their custom skills are picked up
Customer Considerations {}{}(Initial completion while in Refinement status):
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.
<your text here>
- is related to
-
RHELAI-2530 Subset Selection [dev preview]
-
- Closed
-