Deriving the Gradient of the DPO Objective

DM Television

Microsoft Copilot now reads, thinks, and speaks

«

October

»

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Deriving the Gradient of the DPO Objective

Tags: distribution framework

Author: DATE POSTED:August 26, 2024

Feed: Hacker Noon - Medium

View: Original article

:::info Authors:

(1) Rafael Rafailo, Stanford University and Equal contribution; more junior authors listed earlier;

(2) Archit Sharma, Stanford University and Equal contribution; more junior authors listed earlier;

(3) Eric Mitchel, Stanford University and Equal contribution; more junior authors listed earlier;

(4) Stefano Ermon, CZ Biohub;

(5) Christopher D. Manning, Stanford University;

(6) Chelsea Finn, Stanford University.

:::

Table of Links

Abstract and 1. Introduction

3 Preliminaries

4 Direct Preference Optimization

5 Theoretical Analysis of DPO

7 Discussion, Acknowledgements, and References

Author Contributions

\ A Mathematical Derivations

A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective

A.2 Deriving the DPO Objective Under the Bradley-Terry Model

A.3 Deriving the DPO Objective Under the Plackett-Luce Model

A.4 Deriving the Gradient of the DPO Objective and A.5 Proof of Lemma 1 and 2

A.6 Proof of Theorem 1

\ B DPO Implementation Details and Hyperparameters

\ C Further Details on the Experimental Set-Up and C.1 IMDb Sentiment Experiment and Baseline Details

C.2 GPT-4 prompts for computing summarization and dialogue win rates

C.3 Unlikelihood baseline

\ D Additional Empirical Results

D.1 Performance of Best of N baseline for Various N and D.2 Sample Responses and GPT-4 Judgments

D.3 Human study details

A.4 Deriving the Gradient of the DPO Objective

In this section we derive the gradient of the DPO objective:

\

\ We can rewrite the RHS of Equation 21 as

\

\ Using the properties of sigmoid function σ ′ (x) = σ(x)(1 − σ(x)) and σ(−x) = 1 − σ(x), we obtain the final gradient

\

A.5 Proof of Lemma 1 and 2

In this section, we will prove the two lemmas from Section 5.

\ Lemma 1 Restated. Under the Plackett-Luce preference framework, and in particular the Bradley-Terry framework, two reward functions from the same equivalence class induce the same preference distribution.

\

\ which completes the proof.

\ Lemma 2 Restated. Two reward functions from the same equivalence class induce the same optimal policy under the constrained RL problem.

\

\ which completes the proof.

\

:::info This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

:::

\

Feed: Hacker Noon - Medium

View: Original article

Tags: distribution framework