Loading...

XML

Word

Printable

Type: Feature
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: InstructLab - SDG
Labels:
- 1.6-candidate

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Color Status:
Not Selected

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Intelligence Requested:
Market:

Feature Overview (mandatory - Complete while in New status)

Extending the idea for subset selection, we want to allow users to drop data samples tagged with certain keys (example - summarization etc) from their input datasets, SDG generated datasets, pre-computed datasets or any others.

This can be a pre-cursor/used in combination with data mixing and/or subset selection.

Goals (mandatory - Complete while in New status)
Provide high-level goal statement, providing user context and expected user outcome(s) for this Feature

Allow users to customize their dataset(s) by keeping the samples for the topics they consider to be the most useful

Requirements (mandatory -_ Complete while in Refinement status):

Given an input dataset and the tag, this will output a dataset without the samples mapped to that tag.
Ensure consistency with tags and metadata for filtering data - add filters for source, topics etc (understand the RH AI team's existing capabilities for topics and include those and more)

Done - Acceptance Criteria (mandatory - Complete while in Refinement status):

Expose this through CLI and SDK
This can be invoked at any time in the pipeline - pre and post-SDG

Use Cases - i.e. User Experience & Workflow: (Initial completion while in Refinement status):

Consider Labels and selectors in Kubernetes

Out of Scope {}{}(Initial completion while in Refinement status):

N/A

Documentation Considerations {}{}(Initial completion while in Refinement status):
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation..
<your text here>

Questions to Answer {}{}(Initial completion while in Refinement status):

What tags and metadata are we using that will help with dataset manipulation?
How do we evolve ingestion, pre-computed datasets and SDG to ensure our datasets are tagged?
How do we make this a part of the schema validation so that external datasets are also evaluated with these?
Do we want to document this information to ensure users know how to use these tags and when to use which?

Background and Strategic Fit (Initial completion while in Refinement status):

Provide any additional context is needed to frame the feature.

The original use case for filter-by-tag is a part of recipes.yaml for data mixing - it allows users that want to train the model on custom skills to add a 'tag' of their choice in the recipes.yaml that indicates if they want to drop certain samples with those tags from the pre-computed dataset to ensure their custom skills are picked up

Customer Considerations {}{}(Initial completion while in Refinement status):
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.
<your text here>

is related to

RHELAI-2530 Subset Selection [dev preview]

Closed

Assignee:: Jehlum Vitasta Pandit

Reporter:: Aditi Saluja

Contributors:: Aditi Saluja, Jehlum Vitasta Pandit

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2025/01/13 7:46 PM

Updated:: 2025/03/19 3:49 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates