-
Feature
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
Goal:
Enable DPO training via TRL within Llama Stack, so that models can be fine-tuned using preference datasets through the existing post-training workflow.
Acceptance Criteria:
- DPO training can be launched via post-training API
- TRL config supports key DPO hyperparameters
- Preference dataset format (prompt, chosen, rejected) is handled correctly
- Checkpoints and training metrics are saved
- Training runs on single GPU successfully
- Successful test run using any model
Repo:
llama-stack-provider-trl