Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-4007

Automated train run benchmarking GitHub action

XMLWordPrintable

    • Sprint 1

      When evaluating prs in the training repo, it is common that we want to understand how the changes affect the overall performance of the training library with a certain configuration. As a motivating example, consider the ongoing changes to the multipack sampling code. We wish to understand how this change affects the overall runtime of training, while confirming that it doesn't harm model convergence.

      In order to do so, the current process requires the developer to spin up a ec2 instance with enough gpus, setup their environment, run a training job on the new branch with sufficient logging, switch to the main branch, rerun the training job, and visualize the outputs in a way that they can be easily compared.

      The proposal is to simplify this process by creating a GitHub workflow that:

      • Accepts a string of hyper parameter arguments as input to workflow_dispatch
      • Launches a 4xL40S ec2 instance (similar to existing e2e job)
      • Setups up the local environment
      • Checks out the selected branch
      • Runs something like `torchrun main_ds.py --logging ON ${{ user_hyper_params }}
      • Optionally runs the same command on main branch
      • Uploads requires artifacts for comparison (e.g. tensorboard data) or logs to wandb if available
      • Link to relevant artifacts or wandb runs in pr (if pr exists)

      This would reduce developer time required to setup testing pipelines for each and manually run them. It would also add consistency to the testing process for this kind of common comparison.

              rh-ee-fschmitt Fynn Schmitt-Ulms
              rh-ee-fschmitt Fynn Schmitt-Ulms
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: