Loading...

XML

Word

Printable

Type: Spike
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: InstructLab - Training
Labels:
- 2.0-candidate

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
ilab-training sdk-ification
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

DPO is the "easiest" preference tuning technique to implement and the most practical for customer use.

DPO tuning requires samples of shape (prompt, chosen, rejected). Unlike PPO, there is no requirement for a large reward model. Unlike GRPO and PPO, no parallel inference server is required.

Goal:

Build an example of DPO tuning using open datasets and the `instructlab/training` library's common utilities. Investigate alternative implementations (TRL, TRL via Axolotl, Torchtune).

Assignee:: James Kunstle (Inactive)

Reporter:: James Kunstle (Inactive)

Contributors:: Fynn Schmitt-Ulms

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2025/04/10 10:00 PM

Updated:: 2025/05/22 6:36 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates