Shed Some Light on Proximal Policy Optimization (PPO) and Its Application
Published:
Proximal Policy Optimization (PPO) is a reinforcement learning algorithm that refines policy gradient methods like REINFORCE using importance sampling and a clipped surrogate objective to stabilize updates. PPO-Penalty explicitly penalizes KL divergence in the objective function, and PPO-Clip instead uses clipping to prevent large policy updates. In many robotics tasks, PPO is first used to train a base policy (potentially with privileged information). Then, a deployable controller is learned from this base policy using imitation learning, distillation, or other techniques. This blog explores PPO’s core principle, with code available at repo1 and repo2.
Policy Gradient
Policy Gradient methods are a foundational class of algorithms in reinforcement learning (RL) that directly optimize a parameterized policy with respect to expected cumulative reward. Unlike value-based methods such as Q-learning, which aim to learn value functions and derive policies indirectly, policy gradient methods focus on learning the policy itself, often represented as a neural network that outputs a probability distribution over actions given states. This direct optimization approach is particularly powerful in high-dimensional or continuous action spaces, where traditional methods struggle. By treating the policy as a differentiable function, we can apply gradient ascent techniques to iteratively improve performance through interaction with the environment.
Policy gradients form the backbone of many modern RL algorithms, including Actor-Critic, DDPG, PPO, TRPO, and SAC, offering flexibility and scalability in complex tasks like robotic control, games, and simulated environments. However, policy gradient methods are not without challenges—issues such as high variance in gradient estimates and sample inefficiency are well-known. To understand the core idea behind this family of methods, we begin with the most basic and historically significant example: REINFORCE, a Monte Carlo-based algorithm that illustrates how policy gradients can be estimated using sampled trajectories.
REINFORCE
REINFORCE algorithms from Wikipedia The REINFORCE algorithm is an on-policy algorithm. By collecting \(N\) rollout trajectories, it estimates the policy gradient and optimizes the expected return of a parameterized stochastic policy. Let \(\pi_\theta(a \mid s)\) denote a policy that defines a distribution over actions given a state, with parameters \(\theta\). The objective is to maximize the expected cumulative reward:
\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \right]\]where \(\tau = (s_0, a_0, s_1, a_1, \ldots)\) denotes a trajectory sampled from the environment following policy \(\pi_\theta\), and \(R(\tau)\) is the total return collected along the trajectory.
Using the log-derivative trick, the gradient of the objective can be expressed as:
\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot R_t \right]\]where \(R_t = \sum_{\tau=t}^{T} \gamma^{\tau - t} r_\tau\) is the return from timestep \(t\), and \(\gamma \in [0,1]\) is the discount factor.
In practice, this expectation is approximated using \(N\) trajectories:
\[g_t \approx \frac{1}{N} \sum_{n=1}^{N} \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_{t,n} \mid s_{t,n}) \cdot G_{t,n}\]where \(G_{t,n}\) is the empirical return from time \(t\) for the \(n^\text{th}\) trajectory. The policy parameters are then updated via gradient ascent:
\[\theta \leftarrow \theta + \alpha \cdot g_t\]Despite its theoretical elegance and unbiased gradient estimates, REINFORCE suffers from high variance and delayed updates, as gradients can only be computed after full trajectory rollouts. These drawbacks motivate subsequent algorithms like Actor-Critic and Proximal Policy Optimization (PPO), which improve learning stability through value function baselines and clipped surrogate objectives.
Role of Policy Gradient in Modern RL Algorithms
Connection to Optimal Control
KL Divergence and Importance Smapling
KL Divergence
Importance Smapling
Proximal Policy Optimization (PPO)
Pseudocode

Example 1: Cartpole Balancing ( cartpole-dqn-ddpg-ppo)
Combination with Other Methods
Imitation Learning
Example 2: Used with Adaptation Module Learning ( hora)
Summary
- One of the most important reason that make PMP crucial lies in the fact that maximizing the Hamiltonian is much easier than the original infinite-dimensional control problem. It converts the problem of maximizing over a function space to a pointwise optimization.
- ARE can be viewed be derived from the perspective of PMP, whiich is a special case when there is no constraint on control input.
- In contrast to the Hamilton–Jacobi–Bellman equation, which needs to hold over the entire state space to be valid, PMP is potentially more computationally efficient in that the conditions which it specifies only need to hold over a particular trajectory.
References
- REINFORCE: Reinforcement Learning Most Fundamental Algorithm
- Proximal Policy Optimization (OpenAI Spinning Up)
- Overview of Deep Reinforcement Learning Methods
- CS 182: Lecture 15: Part 1: Policy Gradients & Part 2 & Part 3 (Part of CS 182 from UC Berkeley)
- minimal-isaac-gym