Dwell on Differential Dynamic Programming (DDP) and Iterative Linear Quadratic Regulator (iLQR)
Published:
Although optimal control and reinforcement learning appear to be distinct field, they are, in fact closely related. Differential Dynamic Programming (DDP) and Iterative Linear Quadratic Regulator (iLQR), two powerful algorithms commonly utilized in trajectory optimizations, exemplify how model-based reinforcement learning can bridge the gap between these domains. This blog begins by discussing the fundational principles, including Newton’s method and Bellman Equation. It then delves into the specifics of the DDP and iLQR algorithms, illustrating their application through the classical problem of double pendulum swing-up control.
Newton’s Method and Line Search
Netwon’s Method is an iterative optimization algorithm primarily employed for root finding and unconstrained optimization problems. It is a valuable numerical approach that leverages second-order derivative information, especially when the analytical solutions are impractical. Line search is a technique that used in optimization to determine a suitable step size, ensuring appropriate update. It is often used in conjunction with gradient-based methods including Newton’s Method.
Newton’s Method for Root Finding
Updating Equation:
The goal of the root finding problem for a function \(f(x)\) is to find the \(x\) such that \(f(x) = 0\). The equation used for iteratively updating the guess of \(x\) is as follows:
\[x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}\]Derivation:
Start with approximating the function \(f(x)\) by first-order Talyor expansion at \(x_n\):
\[f(x) \approx f(x_n) + f'(x_n)(x - x_n)\]Since the the goal is to find the root of \(f(x)\) , set \(f(x)\) to \(0\) using the approximated expression:
\[0 \approx f(x_n) + f'(x_n)(x_{n+1} - x_n)\]Solving for \(x_{n+1}\):
\[x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}\]
Newton’s Method for Optimization
Newton’s method for optimization seeks to find the minimum or maximum of a function \(f(x)\). This involves solving \(\nabla f(x) = 0\), where \(\nabla f(x)\) is the gradient.
- Updating Equation:
- Derivation:
Use second-order Taylor series expansion to approximate \(f(x)\) around \(x_n\):
\[f(x) \approx f(x_n) + \nabla f(x_n)^T(x - x_n) + \frac{1}{2}(x - x_n)^T H_f(x_n)(x - x_n)\]Take the derivative on both side. For optimization, we need to set \(\nabla f(x) = 0\) using the approximated expression:
\[\nabla f(x) \approx \nabla f(x_n) + H_f(x_n)(x_{n+1} - x_n)\] \[0\approx \nabla f(x_n) + H_f(x_n)(x_{n+1} - x_n)\]Solving for \(x_{n+1}\):
\[x_{n+1} = x_n - H_f(x_n)^{-1}\nabla f(x_n)\]
Newton’s Method with Line Search
The overshoot issue might occur when the function is highly nonlinear. If the quadratic approximation suggests a step size that is too large, it will lead to divergence from the true optimal point. To mitigate the risk of overshoot, a line search can be incorporated into Newton’s method. The idea is to modify the step size along the direction suggested by Newton’s method rather than taking the full step as calculated by \(- H_f(x_n)^{-1} \nabla f(x_n)\).
- Steps for Incorporating Line Search:
- Direction Determination:
- Compute the Newton step direction: \(p_n = -H_f(x_n)^{-1} \nabla f(x_n)\)
- Line Search:
- Perform a line search along the direction \(p_n\) to find an optimal step size \(\alpha\). The goal is to minimize the function \(f\) along this direction: \(x_{n+1} = x_n + \alpha p_n\)
- The line search involves evaluating \(f(x_n + \alpha p_n)\) for different values of \(\alpha\) (typically between 0 and 1) to find the step size that results in the greatest reduction in the function value.
- Backtracking Line Search is one of the systematic approaches that based on Armijo–Goldstein condition. Another techniques called Control Damping introduced in this note can also be used to achieve the same goal.
Bellman’s Principle of Optimality
Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.
Bellman’s Principle of Optimality is a fundamental principle to both optimal control and reinforcement learning. In discrete time setting, it is in the form of Bellman Equation and in continuous time setting, it is expressed in the form of Hamilton-Jacobi-Bellman Equation. Check out this blog post for more detailed explanation on these concepts. Since both iLQR and DDP are mostly used as direct method of trajectory optimization, which are in discrete time settings, we will only focus on the Bellman Equation here.
Bellman Equation
The iLQR algorithm can be seen as an iterative method to solve the Bellman Equation for nonlinear systems by successively improving a control policy through local approximations of the Q-function and the value function. Let’s break down iLQR in terms of the Bellman Equation and discuss how Newton’s Method is related to the optimization process.
1. Bellman Equation Recap:
The Bellman Equation for the optimal value function \(V^*(x)\) is:
\[V^*(x) = \min_{u} \left[ g(x, u) + V^*(f(x, u)) \right]\]where:
- \(g(x, u)\) is the immediate cost at state \(x\) with control \(u\).
- \(f(x, u)\) describes the system dynamics, i.e., how the state transitions from \(x\) to the next state \(x'\).
- \(V^*(f(x, u))\) is the value function evaluated at the next state.
The Q-function \(Q(x, u)\) is the cost of taking action \(u\) in state \(x\) and then following the optimal policy thereafter:
\[Q(x, u) = g(x, u) + V(f(x, u))\]
iLQR Algorithms
DDP Algorithms
Example:
Summary
Root Finding: Uses the first derivative \(f'(x)\). The method in the picture shows: \(x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}\)
Optimization: Uses both the gradient \(\nabla f(x)\) and the Hessian \(H_f(x)\) for multi-dimensional problems. In one dimension, the Hessian is simply \(f''(x)\), but in general optimization problems, it involves the second-order partial derivatives.
References
- CS 285: Lecture 10, Part 4 (Part of Deep Reinforcement Learning Open Course at UC Berkeley)
- Optimal Control (CMU 16-745) 2023 Lecture 11: Differential Dynamic Programming (Part of optimal control open course from CMU)
- Excellent resources from Cornell: lecture slides & notes
- Iterative linear quadratic regulator (Include the connection with Pontryagin’s Maximum Principle)
- RL — LQR & iLQR Linear Quadratic Regulator