Loading...

XML

Word

Printable

Type: Spike
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: InstructLab - Training
Labels:
- 2.0-candidate

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
ilab-training sdk-ification
Intelligence Requested:
Market:

Sprint:
Sprint 1

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

When evaluating prs in the training repo, it is common that we want to understand how the changes affect the overall performance of the training library with a certain configuration. As a motivating example, consider the ongoing changes to the multipack sampling code. We wish to understand how this change affects the overall runtime of training, while confirming that it doesn't harm model convergence.

In order to do so, the current process requires the developer to spin up a ec2 instance with enough gpus, setup their environment, run a training job on the new branch with sufficient logging, switch to the main branch, rerun the training job, and visualize the outputs in a way that they can be easily compared.

The proposal is to simplify this process by creating a GitHub workflow that:

Accepts a string of hyper parameter arguments as input to workflow_dispatch
Launches a 4xL40S ec2 instance (similar to existing e2e job)
Setups up the local environment
Checks out the selected branch
Runs something like `torchrun main_ds.py --logging ON ${{ user_hyper_params }}
Optionally runs the same command on main branch
Uploads requires artifacts for comparison (e.g. tensorboard data) or logs to wandb if available
Link to relevant artifacts or wandb runs in pr (if pr exists)

This would reduce developer time required to setup testing pipelines for each and manually run them. It would also add consistency to the testing process for this kind of common comparison.

is blocked by

RHELAI-4006 Add support for validation dataset

To Do

RHELAI-3895 enable flexible, portable logging backends (wandb, tensorboard, etc.)

Resolved

is depended on by

RHELAI-4008 Customizable testing run for `instructlab/training`

Tasking and Estimation

Assignee:: Fynn Schmitt-Ulms

Reporter:: Fynn Schmitt-Ulms

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/04/24 6:41 PM

Updated:: 2025/05/13 7:39 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates