-
Feature
-
Resolution: Unresolved
-
Major
-
None
-
None
-
False
-
-
False
-
Not Selected
Feature
This work is to refactor the InstructLab training pipeline to support token masking for pre-trained data. This feature will enable us to consume the generate metadata for masking tokens, aligning with the new SDG standard message format. The new SDG process delegates the token masking task to the training pipeline, enhancing its functionality and efficiency.
Goals
- Enable the training pipeline to understand the SDG metadata for token masking for pre-trained data.
- Support and follow the SDG metadata for token masking according to the new SDG standard message format.
- The current pre-trained data flow will be expanded to support the new SDG metadata for token masking.
Requirements
- The training pipeline must be able to consume SDG-generated metadata for token masking.
- The pipeline should adhere to the new SDG standard message format.
- The token masking process should not disrupt the existing training workflow.
Background
The current pre-trained data flow assumes SDG generated the dataset with the token masking, which means information could be lost for the following steps. The new SDG standard message format generates metadata for token masking but do not modify the actual pairs, necessitating a refactored training pipeline to understand the metadata and apply the masking.
Done
- [ ] The training pipeline supports token masking for pre-trained data.
- [ ] The pipeline follows SDG-generates metadata for token masking according to the new SDG standard message format.
- [ ] The token masking process does not disrupt the existing training workflow.
Out of Scope
- [ ] Updating external 3rd-party integrations, which consume intermediate artifacts
Customer Considerations:
- The training pipeline should maintain its efficiency and not introduce any significant delays in the data preparation process.