-
Spike
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
False
-
-
False
-
-
Goal
The goal of this task is to build a deep understanding of Direct Preference Optimization (DPO), implement it from scratch for a small-scale model, and explore strategies for orchestrating distributed training jobs—particularly multi-node and multi-GPU setups—within the llama-stack framework.
Goal and Acceptance Criteria
- Understand the theory behind DPO and how it differs from RLHF
- Implement DPO from scratch on a small model and dataset (e.g., DistilGPT2 or LLaMA-1B)
- Study distributed training frameworks (FSDP, DeepSpeed, PyTorch DDP)
- Learn how multi-node/multi-GPU jobs could be orchestrated in llama-stack and write a short proposal outlining your approach