-
Spike
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
False
-
-
False
-
-
DPO is the "easiest" preference tuning technique to implement and the most practical for customer use.
DPO tuning requires samples of shape (prompt, chosen, rejected). Unlike PPO, there is no requirement for a large reward model. Unlike GRPO and PPO, no parallel inference server is required.
Goal:
Build an example of DPO tuning using open datasets and the `instructlab/training` library's common utilities. Investigate alternative implementations (TRL, TRL via Axolotl, Torchtune).