:::info Authors:
(1) Jongmin Lee, Department of Mathematical Science, Seoul National University;
(2) Ernest K. Ryu, Department of Mathematical Science, Seoul National University and Interdisciplinary Program in Artificial Intelligence, Seoul National University.
:::
1.1 Notations and preliminaries
2.1 Accelerated rate for Bellman consistency operator
2.2 Accelerated rate for Bellman optimality opera
5 Approximate Anchored Value Iteration
6 Gauss–Seidel Anchored Value Iteration
7 Conclusion, Acknowledgments and Disclosure of Funding and References
1.2 Prior worksValue Iteration. Value iteration (VI) was first introduced in the DP literature [8] for finding optimal value function, and its variant approximate VI [11, 19, 30, 32, 56, 81, 90] considers approximate evaluations of the Bellman optimality operator. In RL, VI and approximate VI have served as the basis of RL algorithms such as fitted value iteration [29, 36, 50, 52, 57, 87] and temporal difference learning [41, 54, 80, 89, 94]. There is a line of research that emulates VI by learning a model of the MDP dynamics [62, 83, 85] and applying a modified Bellman operator [7, 33]. Asynchronous VI, another variation of VI updating the coordinate of value function in asynchronous manner, has also been studied in both RL and DP literature [9, 11, 88, 100].
\
\ Acceleration. Since Nesterov’s seminal work [61], there has been a large body of research on acceleration in convex minimization. Gradient descent [15] can be accelerated to efficiently reduce function value and squared gradient magnitude for smooth convex minimization problems [21, 44– 46, 60, 61, 102] and smooth strongly convex minimization problems [59, 64, 73, 86, 91]. Motivated by Nesterov acceleration, inertial fixed-point iterations [22, 42, 51, 70, 75] have also been suggested to accelerate fixed-point iterations. Anderson acceleration [2], another acceleration scheme for fixedpoint iterations, has recently been studied with interest [6, 74, 93, 101].
\ In DP and RL, prioritized sweeping [55] is a well-known method that changes the order of updates to accelerate convergence, and several variants [3, 18, 53, 68, 95] have been proposed. Speedy Q-learning [4] modifies the update rule of Q-learning and uses aggressive learning rates for acceleration. Recently, there has been a line of research that applies acceleration techniques of other areas to VI: [27, 28, 34, 67, 76, 79] uses Anderson acceleration of fixed-point iterations, [1, 12, 37, 38, 92] uses Nesterov acceleration of convex optimization, and [31] uses ideas inspired by PID controllers in control theory. Among those works, [1, 37, 38] applied Nesterov acceleration to obtain theoretically accelerated convergence rates, but those analyses require certain reversibility conditions or restrictions on eigenvalues of the transition probability induced by the policy.
\ The anchor acceleration, a new acceleration mechanism distinct from Nesterov’s, lately gained attention in convex optimization and fixed-point theory. The anchoring mechanism, which retracts iterates towards the initial point, has been used to accelerate algorithms for minimax optimization and fixed-point problems [20, 43, 47, 65, 71, 78, 98, 99], and we focus on it in this paper.
\
\
:::info This paper is available on arxiv under CC BY 4.0 DEED license.
:::
\
All Rights Reserved. Copyright , Central Coast Communications, Inc.