Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-4123

investigate metric logging and experiment tracking tools

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • InstructLab - Training
    • None
    • investigate metric logging and experiment tracking tools
    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected
    • To Do

      Background:

      • Model training runs generate lots of useful data. We want to track:
        • run configuration and hyperparameters
        • per-process loop metrics in a distributed training environment (single and multi-node)
        • per-process logs
        • per-process and process-group run timings
        • groups of runs as "experiments," differentiable as projects scoped to different hypotheses or monitoring requirements.
      • We also need to be able to monitor changes in this data over a long period of time. If a run fails on day 'n' we'd like to know what it's dynamics were on day 'n-50'. 
      • There are a few well-known solutions in the wild. Currently, RHOAI does not have an integrated solution, but development is starting. 'Katib' supports hyperparameter sweeps but doesn't solve the problem in a way we'd need. Current solutions include:
        • AimStack
        • MLflow
        • Vertex AI
        • WandB
        • ClearML
      • `wandb` has some marketshare in projects like `axolotl` and `hftrainer.`

      Acceptance Criteria:

      The Acceptance Criteria provides a definition of scope and the expected outcomes - from a users point of view - defines the value proposition

      • Create an executive brief that describes, from our team's POV, which tool strikes a convenient balance between:
        • Ease of use and added velocity.
        • Consumer affinity- could our customers use this like we would without special knowledge?
        • OSS friendliness- we don't really want to pay for a SaaS that's incompatible with our values unless the product obviously outcompetes other options.
      • Decide which tool to start consuming for our team's experiment tracking / metric logging. 
      • Demo the use of the tool to other teams to build shared workflow understanding.

       

      Open questions:

      Any additional details, questions or decisions that need to be made/addressed

      • Would a RHOAI-focused experiment and metric-tracking solution be standalone? If not, what would a recommended alternative be for customers?

              rh-ee-fschmitt Fynn Schmitt-Ulms
              rhn-support-jkunstle James Kunstle (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: