-
Epic
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
investigate metric logging and experiment tracking tools
-
False
-
-
False
-
Not Selected
-
To Do
Background:
- Model training runs generate lots of useful data. We want to track:
-
- run configuration and hyperparameters
- per-process loop metrics in a distributed training environment (single and multi-node)
- per-process logs
- per-process and process-group run timings
- groups of runs as "experiments," differentiable as projects scoped to different hypotheses or monitoring requirements.
- We also need to be able to monitor changes in this data over a long period of time. If a run fails on day 'n' we'd like to know what it's dynamics were on day 'n-50'.
- There are a few well-known solutions in the wild. Currently, RHOAI does not have an integrated solution, but development is starting. 'Katib' supports hyperparameter sweeps but doesn't solve the problem in a way we'd need. Current solutions include:
- AimStack
- MLflow
- Vertex AI
- WandB
- ClearML
- `wandb` has some marketshare in projects like `axolotl` and `hftrainer.`
Acceptance Criteria:
The Acceptance Criteria provides a definition of scope and the expected outcomes - from a users point of view - defines the value proposition
- Create an executive brief that describes, from our team's POV, which tool strikes a convenient balance between:
- Ease of use and added velocity.
- Consumer affinity- could our customers use this like we would without special knowledge?
- OSS friendliness- we don't really want to pay for a SaaS that's incompatible with our values unless the product obviously outcompetes other options.
- Decide which tool to start consuming for our team's experiment tracking / metric logging.
- Demo the use of the tool to other teams to build shared workflow understanding.
Open questions:
Any additional details, questions or decisions that need to be made/addressed
- Would a RHOAI-focused experiment and metric-tracking solution be standalone? If not, what would a recommended alternative be for customers?
- relates to
-
RHELAI-3895 enable flexible, portable logging backends (wandb, tensorboard, etc.)
-
- Resolved
-