Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-3913

build DPO example notebook / script

XMLWordPrintable

      DPO is the "easiest" preference tuning technique to implement and the most practical for customer use. 

      DPO tuning requires samples of shape (prompt, chosen, rejected). Unlike PPO, there is no requirement for a large reward model. Unlike GRPO and PPO, no parallel inference server is required.

       

      Goal:

      Build an example of DPO tuning using open datasets and the `instructlab/training` library's common utilities. Investigate alternative implementations (TRL, TRL via Axolotl, Torchtune).

              rhn-support-jkunstle James Kunstle (Inactive)
              rhn-support-jkunstle James Kunstle (Inactive)
              Fynn Schmitt-Ulms
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: