News

Introducing Direct Preference Optimization (DPO) Support on OpenPipe

Kyle Corbitt

Oct 1, 2024

4 minutes

We're thrilled to announce that OpenPipe is the first fine-tuning platform to support Direct Preference Optimization (DPO)! This powerful technique allows you to align models with your specific requirements even more strongly. When used in conjunction with user-defined criteria, DPO is uniquely effective at ensuring model outputs meet even complex criteria.

What is DPO?

Direct Preference Optimization (DPO) is an advanced fine-tuning method that allows models to learn directly from preference data. Unlike supervised fine-tuning (SFT) that relies solely on inputs and outputs, DPO allows you to train on preference data, where you provide an input along with both a preferred and rejected output. This allows the model to learn very nuanced preferences that can be difficult to express in a supervised manner.

When to use DPO?

DPO is useful when you have a source of preference data that you can exploit. There are many possible sources of preference data, depending on your use case:

  1. Expert Feedback: you may have a team of experts who can evaluate your model's outputs and edit them to make them better. You can use the original and edited outputs as rejected and preferred outputs respectively. DPO can be effective with just a few preference pairs.

  2. Criteria Feedback: if you use OpenPipe criteria or another evaluation framework that assigns a score to an output based on how well it meets certain criteria, you can run several generations and use the highest and lowest scoring outputs as preferred and non-preferred outputs respectively.

  3. User Choice: if you have a chatbot-style interface where users can select their preferred response from a list of generated outputs, you can use the selected and rejected outputs as preference data.

  4. User Regenerations: if a user is able to regenerate an action multiple times and then eventually accepts one of the outputs, you can use the first output they rejected as a non-preferred output and the accepted output as a preferred output.

  5. User Edits: if your model creates a draft output and the user is able to edit it and then save, you can use the original draft as a non-preferred output and the edited draft as a preferred output.

Getting Started with DPO on OpenPipe

Our documentation provides detailed guidance on formatting your preference data and configuring DPO training runs. Integrating DPO into your fine-tuning workflow on OpenPipe is straightforward:

  1. Prepare your preference data, including preferred and non-preferred outputs for given inputs.

  2. Upload your dataset to OpenPipe.

  3. Select the DPO option when configuring your fine-tuning job.

  4. Launch your training run.

Early Results

Initial tests with DPO on OpenPipe have shown promising results. DPO, when used with user-defined criteria, allows you to fine-tune models that more consistently respect even very nuanced preferences.

Here are some representative results from OpenPipe customer tasks:

  • Word Limit: for a summarization task with an explicit word limit given in the prompt (dynamically changing per prompt), DPO was able to cut the number of responses exceeding the limit from 31% to 7%, a 77% decrease.

  • Highlight Format: for a content formatting task, DPO was able to drop the percentage of times the wrong word or phrase was highlighted from 17.3% to 1.7%, a 90% decrease.

  • Hallucination: for an information extraction task, DPO was able to drop the fraction of outputs with hallucinated information from 12.7% to 3.0%, a 76% decrease.

  • Result Relevance: for a classification task determining whether a result was relevant to a query, DPO was able to drop the mis-classification rate from 4.7% to 1.3%, a 72% decrease.

We're excited to see how you'll leverage DPO to create even more powerful and tailored models for your specific needs!

Sneak Preview: Continual Learning

Right now we support DPO on static datasets, but we're working with a few early customers to integrate DPO into an online learning workflow. This allows them to integrate their user feedback into model updates continuously, which will drive even better model quality over time. If you're interested in working with us to build this for your business, please reach out to hello@openpipe.ai!

Stay tuned for more updates, and happy fine-tuning!

Kyle

Introducing Direct Preference Optimization (DPO) Support on OpenPipe

Introducing Direct Preference Optimization (DPO) Support on OpenPipe

About OpenPipe

OpenPipe is the easiest way to train and deploy your own fine-tuned models. It only takes a few minutes to get started and can save you 25x relative to OpenAI with higher quality.

About OpenPipe

OpenPipe is the easiest way to train and deploy your own fine-tuned models. It only takes a few minutes to get started and can save you 25x relative to OpenAI with higher quality.