:::info Authors:
(1) Jongmin Lee, Department of Mathematical Science, Seoul National University;
(2) Ernest K. Ryu, Department of Mathematical Science, Seoul National University and Interdisciplinary Program in Artificial Intelligence, Seoul National University.
:::
1.1 Notations and preliminaries
2.1 Accelerated rate for Bellman consistency operator
2.2 Accelerated rate for Bellman optimality opera
5 Approximate Anchored Value Iteration
6 Gauss–Seidel Anchored Value Iteration
7 Conclusion, Acknowledgments and Disclosure of Funding and References
7 ConclusionWe show that the classical value iteration (VI) is, in fact, suboptimal and that the anchoring mechanism accelerates VI to be optimal in the sense that the accelerated rate matches a complexity lower bound up to a constant factor of 4. We also show that the accelerated iteration provably converges to a fixed point even when γ = 1, if a fixed point exists. Being able to provide a substantive improvement upon the classical VI is, in our view, a surprising contribution.
\ One direction of future work is to study the empirical effectiveness of Anc-VI. Another direction is to analyze Anc-VI in a model-free setting and, more broadly, to investigate the effectiveness of the anchor mechanism in more practical RL methods.
\ Our results lead us to believe that many of the classical foundations of dynamic programming and reinforcement learning may be improved with a careful examination based on an optimization complexity theory perspective. The theory of optimal optimization algorithms has recently enjoyed significant developments [43–45, 66, 98], the anchoring mechanism being one such example [49, 65], and the classical DP and RL theory may benefit from a similar line of investigation on iteration complexity.
Acknowledgments and Disclosure of FundingThis work was supported by the the Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) [NO.2021-0-01343, Artificial Intelligence Graduate School Program (Seoul National University)] and the Samsung Science and Technology Foundation (Project Number SSTF-BA2101-02). We thank Jisun Park for providing valuable feedback.
References[1] M. Akian, S. Gaubert, Z. Qu, and O. Saadi. Multiply accelerated value iteration for nonsymmetric affine fixed point problems and application to markov decision processes. SIAM Journal on Matrix Analysis and Applications, 43(1):199–232, 2022.
\ [2] D. G. Anderson. Iterative procedures for nonlinear integral equations. Journal of the Association for Computing Machinery, 12(4):547–560, 1965.
\ [3] D. Andre, N. Friedman, and R. Parr. Generalized prioritized sweeping. Neural Information Processing Systems, 1997.
\ [4] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. Neural Information Processing Systems, 2011.
\ [5] S. Banach. Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fundamenta Mathematicae, 3(1):133–181, 1922.
\ [6] M. Barré, A. Taylor, and A. d’Aspremont. Convergence of a constrained vector extrapolation scheme. SIAM Journal on Mathematics of Data Science, 4(3):979–1002, 2022.
\ [7] M. G. Bellemare, G. Ostrovski, A. Guez, P. Thomas, and R. Munos. Increasing the action gap: New operators for reinforcement learning. Association for the Advancement of Artificial Intelligence, 2016.
\ [8] R. Bellman. A Markovian decision process. Journal of Mathematics and Mechanics, 6(5):679–684, 1957.
\ [9] D. Bertsekas and J. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Athena Scientific, 2015.
\ [10] D. P. Bertsekas. Dynamic Programming and Optimal Control, volume II. 4th edition, 2012.
\ [11] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1995.
\ [12] W. Bowen, X. Huaqing, Z. Lin, L. Yingbin, and Z. Wei. Finite-time theory for momentum Q-learning. Conference on Uncertainty in Artificial Intelligence, 2021.
\ [13] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationary points I. Mathematical Programming, 184(1–2):71–120, 2020.
\ [14] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationary points II: first-order methods. Mathematical Programming, 185(1–2):315–355, 2021.
\ [15] A.-L. Cauchy. Méthode générale pour la résolution des systemes d’équations simultanées. Comptes rendus de l’Académie des Sciences, 25:536–538, 1847.
\ [16] V. Colao and G. Marino. On the rate of convergence of Halpern iterations. Journal of Nonlinear and Convex Analysis, 22(12):2639–2646, 2021.
\ [17] J. P. Contreras and R. Cominetti. Optimal error bounds for non-expansive fixed-point iterations in normed spaces. Mathematical Programming, 199(1–2):343–374, 2022.
\ [18] P. Dai, D. S. Weld, J. Goldsmith, et al. Topological value iteration algorithms. Journal of Artificial Intelligence Research, 42:181–209, 2011.
\ [19] D. P. De Farias and B. Van Roy. On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Optimization theory and Applications, 105:589–608, 2000.
\ [20] J. Diakonikolas. Halpern iteration for near-optimal and parameter-free monotone inclusion and strong solutions to variational inequalities. Conference on Learning Theory, 2020.
\ [21] J. Diakonikolas and P. Wang. Potential function-based framework for minimizing gradients in convex and min-max optimization. SIAM Journal on Optimization, 32(3):1668–1697, 2022.
\ [22] Q. Dong, H. Yuan, Y. Cho, and T. M. Rassias. Modified inertial Mann algorithm and inertial CQ-algorithm for nonexpansive mappings. Optimization Letters, 12(1):87–102, 2018.
\ [23] Y. Drori. The exact information-based complexity of smooth convex minimization. Journal of Complexity, 39:1–16, 2017.
\ [24] Y. Drori and O. Shamir. The complexity of finding stationary points with stochastic gradient descent. International Conference on Machine Learning, 2020.
\ [25] Y. Drori and A. Taylor. On the oracle complexity of smooth strongly convex minimization. Journal of Complexity, 68, 2022.
\ [26] Y. Drori and M. Teboulle. An optimal variant of Kelley’s cutting-plane method. Mathematical Programming, 160(1–2):321–351, 2016.
\ [27] M. Ermis, M. Park, and I. Yang. On Anderson acceleration for partially observable Markov decision processes. IEEE Conference on Decision and Control, 2021.
\ [28] M. Ermis and I. Yang. A3DQN: Adaptive Anderson acceleration for deep Q-networks. IEEE Symposium Series on Computational Intelligence, 2020.
\ [29] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005.
\ [30] D. Ernst, M. Glavic, P. Geurts, and L. Wehenkel. Approximate value iteration in the reinforcement learning context. Application to electrical power system control. International Journal of Emerging Electric Power Systems, 3(1), 2005.
\ [31] A.-m. Farahmand and M. Ghavamzadeh. PID accelerated value iteration algorithm. International Conference on Machine Learning, 2021.
\ [32] A.-m. Farahmand, C. Szepesvári, and R. Munos. Error propagation for approximate policy and value iteration. Neural Information Processing Systems, 2010.
\ [33] M. Fellows, K. Hartikainen, and S. Whiteson. Bayesian Bellman operators. Neural Information Processing Systems, 2021.
\ [34] M. Geist and B. Scherrer. Anderson acceleration for reinforcement learning. European Workshop on Reinforcement Learning, 2018.
\ [35] N. Golowich, S. Pattathil, C. Daskalakis, and A. Ozdaglar. Last iterate is slower than averaged iterate in smooth convex-concave saddle point problems. Conference on Learning Theory, 2020.
\ [36] G. J. Gordon. Stable function approximation in dynamic programming. International Conference on Machine Learning, 1995.
\ [37] V. Goyal and J. Grand-Clément. A first-order approach to accelerated value iteration. Operations Research, 71(2):517–535, 2022.
\ [38] J. Grand-Clément. From convex optimization to MDPs: A review of first-order, second-order and quasi-newton methods for MDPs. arXiv:2104.10677, 2021.
\ [39] B. Halpern. Fixed points of nonexpanding maps. Bulletin of the American Mathematical Society, 73(6):957–961, 1967.
\ [40] R. Hannah, Y. Liu, D. O’Connor, and W. Yin. Breaking the span assumption yields fast finite-sum minimization. Neural Information Processing Systems, 2018.
\ [41] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. Association for the Advancement of Artificial Intelligence, 2018.
\ [42] F. Iutzeler and J. M. Hendrickx. A generic online acceleration scheme for optimization algorithms via relaxation and inertia. Optimization Methods and Software, 34(2):383–405, 2019.
\ [43] D. Kim. Accelerated proximal point method for maximally monotone operators. Mathematical Programming, 190(1–2):57–87, 2021.
\ [44] D. Kim and J. A. Fessler. Optimized first-order methods for smooth convex minimization. Mathematical Programming, 159(1–2):81–107, 2016.
\ [45] D. Kim and J. A. Fessler. Optimizing the efficiency of first-order methods for decreasing the gradient of smooth convex functions. Journal of Optimization Theory and Applications, 188(1):192–219, 2021.
\ [46] J. Lee, C. Park, and E. K. Ryu. A geometric structure of acceleration and its role in making gradients small fast. Neural Information Processing Systems, 2021.
\ [47] S. Lee and D. Kim. Fast extra gradient methods for smooth structured nonconvex-nonconcave minimax problems. Neural Information Processing Systems, 2021.
\ [48] L. Leustean. Rates of asymptotic regularity for Halpern iterations of nonexpansive mappings. Journal of Universal Computer Science, 13(11):1680–1691, 2007.
\ [49] F. Lieder. On the convergence rate of the Halpern-iteration. Optimization Letters, 15(2):405– 418, 2021.
\ [50] M. Lutter, S. Mannor, J. Peters, D. Fox, and A. Garg. Value iteration in continuous actions, states and time. International Conference on Machine Learning, 2021.
\ [51] P.-E. Maingé. Convergence theorems for inertial KM-type algorithms. Journal of Computational and Applied Mathematics, 219(1):223–236, 2008.
\ [52] A. massoud Farahmand, M. Ghavamzadeh, C. Szepesvári, and S. Mannor. Regularized fitted Q-iteration for planning in continuous-space Markovian decision problems. American Control Conference, 2009.
\ [53] H. B. McMahan and G. J. Gordon. Fast exact planning in Markov decision processes. International Conference on Automated Planning and Scheduling, 2005.
\ [54] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
\ [55] A. W. Moore and C. G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13:103–130, 1993.
\ [56] R. Munos. Error bounds for approximate value iteration. Association for the Advancement of Artificial Intelligence, 2005.
\ [57] R. Munos and C. Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(27):815–857, 2008.
\ [58] A. S. Nemirovski. Information-based complexity of linear operator equations. Journal of Complexity, 8(2):153–175, 1992.
\ [59] Y. Nesterov. Lectures on Convex Optimization. Springer, 2nd edition, 2018.
\ [60] Y. Nesterov, A. Gasnikov, S. Guminov, and P. Dvurechensky. Primal–dual accelerated gradient methods with small-dimensional relaxation oracle. Optimization Methods and Software, 36(4):773–810, 2021.
\ [61] Y. E. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k2 ). Doklady Akademii Nauk SSSR, 269(3):543–547, 1983.
\ [62] S. Niu, S. Chen, H. Guo, C. Targonski, M. Smith, and J. Kovacevi ˇ c. Generalized value itera- ´ tion networks: Life beyond lattices. Association for the Advancement of Artificial Intelligence, 2018.
\ [63] C. Nota and P. Thomas. Is the policy gradient a gradient? International Conference on Autonomous Agents and Multiagent Systems, 2020.
\ [64] C. Park, J. Park, and E. K. Ryu. Factor-√ 2 acceleration of accelerated gradient methods. Applied Mathematics & Optimization, 2023.
\ [65] J. Park and E. K. Ryu. Exact optimal accelerated complexity for fixed-point iterations. International Conference on Machine Learning, 2022.
\ [66] J. Park and E. K. Ryu. Accelerated infeasibility detection of constrained optimization and fixed-point iterations. International Conference on Machine Learning, 2023.
\ [67] M. Park, J. Shin, and I. Yang. Anderson acceleration for partially observable Markov decision processes: A maximum entropy approach. arXiv:2211.14998, 2022.
\ [68] J. Peng and R. J. Williams. Efficient learning and planning within the Dyna framework. Adaptive Behavior, 1(4):437–454, 1993.
\ [69] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley and Sons, 1994.
\ [70] S. Reich. Strong convergence theorems for resolvents of accretive operators in Banach spaces. Journal of Mathematical Analysis and Applications, 75(1):287–292, 1980.
\ [71] E. K. Ryu, K. Yuan, and W. Yin. Ode analysis of stochastic gradient methods with optimism and anchoring for minimax problems. arXiv:1905.10899, 2019.
\ [72] S. Sabach and S. Shtern. A first order method for solving convex bilevel optimization problems. SIAM Journal on Optimization, 27(2):640–660, 2017.
\ [73] A. Salim, L. Condat, D. Kovalev, and P. Richtárik. An optimal algorithm for strongly convex minimization under affine constraints. International Conference on Artificial Intelligence and Statistics, 2022.
\ [74] D. Scieur, A. d’Aspremont, and F. Bach. Regularized nonlinear acceleration. Mathematical Programming, 179(1–2):47–83, 2020.
\ [75] Y. Shehu. Convergence rate analysis of inertial Krasnoselskii–Mann type iteration with applications. Numerical Functional Analysis and Optimization, 39(10):1077–1091, 2018.
\ [76] W. Shi, S. Song, H. Wu, Y.-C. Hsu, C. Wu, and G. Huang. Regularized Anderson acceleration for off-policy deep reinforcement learning. Neural Information Processing Systems, 2019.
\ [77] O. Shlakhter, C.-G. Lee, D. Khmelev, and N. Jaber. Acceleration operators in the value iteration algorithms for Markov decision processes. Operations Research, 58(1):193–202, 2010.
\ [78] J. J. Suh, J. Park, and E. K. Ryu. Continuous-time analysis of anchor acceleration. Neural Information Processing Systems, 2023.
\ [79] K. Sun, Y. Wang, Y. Liu, B. Pan, S. Jui, B. Jiang, L. Kong, et al. Damped Anderson mixing for deep reinforcement learning: Acceleration, convergence, and stabilization. Neural Information Processing Systems, 2021.
\ [80] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988.
\ [81] R. S. Sutton and A. G. Barto. Reinforcement Learning: An introduction. MIT press, 2nd edition, 2018.
\ [82] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Neural Information Processing Systems, 1999.
\ [83] Q. Sykora, M. Ren, and R. Urtasun. Multi-agent routing value iteration network. International Conference on Machine Learning, 2020.
\ [84] C. Szepesvári. Algorithms for Reinforcement Learning. Springer, 1st edition, 2010.
\ [85] A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel. Value iteration networks. Neural Information Processing Systems, 2016.
\ [86] A. Taylor and Y. Drori. An optimal gradient method for smooth strongly convex minimization. Mathematical Programming, 199(1-2):557–594, 2023.
\ [87] S. Tosatto, M. Pirotta, C. d’Eramo, and M. Restelli. Boosted fitted Q-iteration. International Conference on Machine Learning, 2017.
\ [88] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994.
\ [89] H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double Q-learning. Association for the Advancement of Artificial Intelligence, 2016.
\ [90] B. Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Mathematics of Operations Research, 31(2):234–244, 2006.
\ [91] B. Van Scoy, R. A. Freeman, and K. M. Lynch. The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Systems Letters, 2(1):49–54, 2018.
\ [92] N. Vieillard, B. Scherrer, O. Pietquin, and M. Geist. Momentum in reinforcement learning. International Conference on Artificial Intelligence and Statistics, 2020.
\ [93] H. F. Walker and P. Ni. Anderson acceleration for fixed-point iterations. SIAM Journal on Numerical Analysis, 49(4):1715–1735, 2011.
\ [94] C. J. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992.
\ [95] D. Wingate, K. D. Seppi, and S. Mahadevan. Prioritization methods for accelerating MDP solvers. Journal of Machine Learning Research, 6(25):851–881, 2005.
\ [96] R. Wittmann. Approximation of fixed points of nonexpansive mappings. Archiv der Mathematik, 58(5):486–491, 1992.
\ [97] H.-K. Xu. Iterative algorithms for nonlinear operators. Journal of the London Mathematical Society, 66(1):240–256, 2002.
\ [98] T. Yoon and E. K. Ryu. Accelerated algorithms for smooth convex-concave minimax problems with O(1/k2 ) rate on squared gradient norm. International Conference on Machine Learning, 2021.
\ [99] T. Yoon and E. K. Ryu. Accelerated minimax algorithms flock together. arXiv:2205.11093, 2022.
\ [100] Y. Zeng, F. Feng, and W. Yin. AsyncQVI: Asynchronous-parallel Q-value iteration for discounted Markov decision processes with near-optimal sample complexity. International Conference on Artificial Intelligence and Statistics, 2020.
\ [101] J. Zhang, B. O’Donoghue, and S. Boyd. Globally convergent type-I Anderson acceleration for nonsmooth fixed-point iterations. SIAM Journal on Optimization, 30(4):3170–3197, 2020.
\ [102] K. Zhou, L. Tian, A. M.-C. So, and J. Cheng. Practical schemes for finding near-stationary points of convex finite-sums. International Conference on Artificial Intelligence and Statistics, 2022.
\
:::info This paper is available on arxiv under CC BY 4.0 DEED license.
:::
\
All Rights Reserved. Copyright 2025, Central Coast Communications, Inc.