How Prior Studies Have Advanced Value Iteration and Acceleration in Reinforcement Learning

DM Television

SEC dropping Ripple case is ‘final exclamation mark’ that XRP is not a security — John Deaton

March

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

How Prior Studies Have Advanced Value Iteration and Acceleration in Reinforcement Learning

Author: DATE POSTED:January 14, 2025

Feed: Hacker Noon - Medium

View: Original article

:::info Authors:

(1) Jongmin Lee, Department of Mathematical Science, Seoul National University;

(2) Ernest K. Ryu, Department of Mathematical Science, Seoul National University and Interdisciplinary Program in Artificial Intelligence, Seoul National University.

:::

Abstract and 1 Introduction

1.1 Notations and preliminaries

1.2 Prior works

2 Anchored Value Iteration

2.1 Accelerated rate for Bellman consistency operator

2.2 Accelerated rate for Bellman optimality opera

3 Convergence when y=1

4 Complexity lower bound

5 Approximate Anchored Value Iteration

6 Gauss–Seidel Anchored Value Iteration

7 Conclusion, Acknowledgments and Disclosure of Funding and References

A Preliminaries

B Omitted proofs in Section 2

C Omitted proofs in Section 3

D Omitted proofs in Section 4

E Omitted proofs in Section 5

F Omitted proofs in Section 6

G Broader Impacts

H Limitations

1.2 Prior works

Value Iteration. Value iteration (VI) was first introduced in the DP literature [8] for finding optimal value function, and its variant approximate VI [11, 19, 30, 32, 56, 81, 90] considers approximate evaluations of the Bellman optimality operator. In RL, VI and approximate VI have served as the basis of RL algorithms such as fitted value iteration [29, 36, 50, 52, 57, 87] and temporal difference learning [41, 54, 80, 89, 94]. There is a line of research that emulates VI by learning a model of the MDP dynamics [62, 83, 85] and applying a modified Bellman operator [7, 33]. Asynchronous VI, another variation of VI updating the coordinate of value function in asynchronous manner, has also been studied in both RL and DP literature [9, 11, 88, 100].

\ Acceleration. Since Nesterov’s seminal work [61], there has been a large body of research on acceleration in convex minimization. Gradient descent [15] can be accelerated to efficiently reduce function value and squared gradient magnitude for smooth convex minimization problems [21, 44– 46, 60, 61, 102] and smooth strongly convex minimization problems [59, 64, 73, 86, 91]. Motivated by Nesterov acceleration, inertial fixed-point iterations [22, 42, 51, 70, 75] have also been suggested to accelerate fixed-point iterations. Anderson acceleration [2], another acceleration scheme for fixedpoint iterations, has recently been studied with interest [6, 74, 93, 101].

\ In DP and RL, prioritized sweeping [55] is a well-known method that changes the order of updates to accelerate convergence, and several variants [3, 18, 53, 68, 95] have been proposed. Speedy Q-learning [4] modifies the update rule of Q-learning and uses aggressive learning rates for acceleration. Recently, there has been a line of research that applies acceleration techniques of other areas to VI: [27, 28, 34, 67, 76, 79] uses Anderson acceleration of fixed-point iterations, [1, 12, 37, 38, 92] uses Nesterov acceleration of convex optimization, and [31] uses ideas inspired by PID controllers in control theory. Among those works, [1, 37, 38] applied Nesterov acceleration to obtain theoretically accelerated convergence rates, but those analyses require certain reversibility conditions or restrictions on eigenvalues of the transition probability induced by the policy.

\ The anchor acceleration, a new acceleration mechanism distinct from Nesterov’s, lately gained attention in convex optimization and fixed-point theory. The anchoring mechanism, which retracts iterates towards the initial point, has been used to accelerate algorithms for minimax optimization and fixed-point problems [20, 43, 47, 65, 71, 78, 98, 99], and we focus on it in this paper.

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

Feed: Hacker Noon - Medium

View: Original article