Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-2420

Refactor the training pipeline to support token masking for pre-trained data

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected

      Feature

      This work is to refactor the InstructLab training pipeline to support token masking for pre-trained data. This feature will enable us to consume the generate metadata for masking tokens, aligning with the new SDG standard message format. The new SDG process delegates the token masking task to the training pipeline, enhancing its functionality and efficiency.

      Goals

      • Enable the training pipeline to understand the SDG metadata for token masking for pre-trained data.
      • Support and follow the SDG metadata for token masking according to the new SDG standard message format.
      • The current pre-trained data flow will be expanded to support the new SDG metadata for token masking.

      Requirements

      • The training pipeline must be able to consume SDG-generated metadata for token masking.
      • The pipeline should adhere to the new SDG standard message format.
      • The token masking process should not disrupt the existing training workflow.

      Background

      The current pre-trained data flow assumes SDG generated the dataset with the token masking, which means information could be lost for the following steps. The new SDG standard message format generates metadata for token masking but do not modify the actual pairs, necessitating a refactored training pipeline to understand the metadata and apply the masking.

      Done

      • [ ] The training pipeline supports token masking for pre-trained data.
      • [ ] The pipeline follows SDG-generates metadata for token masking according to the new SDG standard message format.
      • [ ] The token masking process does not disrupt the existing training workflow.

      Out of Scope

      • [ ] Updating external 3rd-party integrations, which consume intermediate artifacts

      Customer Considerations:

      • The training pipeline should maintain its efficiency and not introduce any significant delays in the data preparation process.

              wcabanba@redhat.com William Caban
              wcabanba@redhat.com William Caban
              Mustafa Eyceoz
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: