Loading...

XML

Word

Printable

Background:

- run configuration and hyperparameters
- per-process loop metrics in a distributed training environment (single and multi-node)
- per-process logs
- per-process and process-group run timings
- groups of runs as "experiments," differentiable as projects scoped to different hypotheses or monitoring requirements.
We also need to be able to monitor changes in this data over a long period of time. If a run fails on day 'n' we'd like to know what it's dynamics were on day 'n-50'.
There are a few well-known solutions in the wild. Currently, RHOAI does not have an integrated solution, but development is starting. 'Katib' supports hyperparameter sweeps but doesn't solve the problem in a way we'd need. Current solutions include:
- AimStack
- MLflow
- Vertex AI
- WandB
- ClearML
`wandb` has some marketshare in projects like `axolotl` and `hftrainer.`

Acceptance Criteria:

The Acceptance Criteria provides a definition of scope and the expected outcomes - from a users point of view - defines the value proposition

Create an executive brief that describes, from our team's POV, which tool strikes a convenient balance between:
- Ease of use and added velocity.
- Consumer affinity- could our customers use this like we would without special knowledge?
- OSS friendliness- we don't really want to pay for a SaaS that's incompatible with our values unless the product obviously outcompetes other options.
Decide which tool to start consuming for our team's experiment tracking / metric logging.
Demo the use of the tool to other teams to build shared workflow understanding.

Open questions:

Any additional details, questions or decisions that need to be made/addressed

Would a RHOAI-focused experiment and metric-tracking solution be standalone? If not, what would a recommended alternative be for customers?

relates to

RHELAI-3895 enable flexible, portable logging backends (wandb, tensorboard, etc.)