Loading...

XML

Word

Printable

Type: Feature
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: InstructLab - SDG, InstructLab - Training
Labels:
- 1.4-candidate

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Color Status:
Not Selected

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Intelligence Requested:
Market:

Feature

This work is to refactor the InstructLab training pipeline to support token masking for pre-trained data. This feature will enable us to consume the generate metadata for masking tokens, aligning with the new SDG standard message format. The new SDG process delegates the token masking task to the training pipeline, enhancing its functionality and efficiency.

Goals

Enable the training pipeline to understand the SDG metadata for token masking for pre-trained data.
Support and follow the SDG metadata for token masking according to the new SDG standard message format.
The current pre-trained data flow will be expanded to support the new SDG metadata for token masking.

Requirements

The training pipeline must be able to consume SDG-generated metadata for token masking.
The pipeline should adhere to the new SDG standard message format.
The token masking process should not disrupt the existing training workflow.

Background

The current pre-trained data flow assumes SDG generated the dataset with the token masking, which means information could be lost for the following steps. The new SDG standard message format generates metadata for token masking but do not modify the actual pairs, necessitating a refactored training pipeline to understand the metadata and apply the masking.

Done

[ ] The training pipeline supports token masking for pre-trained data.
[ ] The pipeline follows SDG-generates metadata for token masking according to the new SDG standard message format.
[ ] The token masking process does not disrupt the existing training workflow.

Out of Scope

[ ] Updating external 3rd-party integrations, which consume intermediate artifacts

Customer Considerations:

The training pipeline should maintain its efficiency and not introduce any significant delays in the data preparation process.

Assignee:: William Caban

Reporter:: William Caban

Contributors:: Mustafa Eyceoz

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/11/27 4:45 PM

Updated:: 2024/12/13 7:24 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates