Interpreting LQR through Optimal Control and Reinforcement Learning

less than 1 minute read

Published: May 15, 2026

This post explains the Linear Quadratic Regulator (LQR) from two complementary viewpoints: classical optimal control and reinforcement learning. Starting from the finite- and infinite-horizon optimal control formulation, it derives the Riccati equation and optimal feedback law, then reinterprets the same results through value functions, Q-functions, policy iteration, and value iteration. Drawing on the connection highlighted in Reinforcement Learning and Adaptive Dynamic Programming for Feedback Control, the post shows how LQR serves as a clean bridge between control theory and RL, clarifying how dynamic programming ideas underpin both frameworks.

Policy Iteration

Results

Value Iteration

Results

Model-free Q learning

References

You May Also Enjoy

Implementation Details of TD3-SAC-Gymnasium Permalink

20 minute read

Published: December 15, 2025

Twin Delayed Deep Deterministic Policy Gradient (TD3) and Soft Actor-Critic (SAC) are off-policy actor-critic algorithms designed for continuous control tasks where classic DDPG can be unhandy. TD3 stabilizes learning with tricks such as double Q-networks, delayed policy updates, and target policy smoothing to reduce overestimation bias. SAC instead learns a stochastic policy by maximizing both task reward and entropy, encouraging robust and exploratory behaviors. This blog post explains core components and implementation details of both algorithms. Corresponding PyTorch implementation can be found at this repository.

Implementation Details of Cartoon-VAE-Diffusion Permalink

1 minute read

Published: June 30, 2025

Variational Autoencoder (VAE) learns to compress data into a latent Gaussian space and reconstruct it in a single shot. Denoising Diffusion Probabilistic Model (DDPM) tackles the same evidence-lower-bound objective from another direction: it begins with pure noise and iteratively denoise through hundreds of steps, exchanging speed for high-fidelity, stable synthesis. Both frameworks connect random noise to data, yet VAE rely on an explicit encoder–decoder pair, whereas DDPM use a learned Markov chain that inverts a forward noising process. This blog traces the progression from VAE to DDPM, clarifying their shared principles, with code examples available at this repository.

Shed Some Light on Proximal Policy Optimization (PPO) and Its Application Permalink

15 minute read

Published: May 31, 2025

Proximal Policy Optimization (PPO) is a reinforcement learning algorithm that refines policy gradient methods like REINFORCE using importance sampling and a clipped surrogate objective to stabilize updates. PPO-Penalty explicitly penalizes KL divergence in the objective function, and PPO-Clip instead uses clipping to prevent large policy updates. In many robotics tasks, PPO is first used to train a base policy (potentially with privileged information). Then, a deployable controller is learned from this base policy using imitation learning, distillation, or other techniques. This blog explores PPO’s core principle, with code available at this repository.

From Q-Learning to Deep Q-Learning and Deep Deterministic Policy Gradient (DDPG) Permalink

16 minute read

Published: March 10, 2025

Q-learning, an off-policy reinforcement learning algorithm, uses the Bellman equation to iteratively update state-action values, helping an agent determine the best actions to maximize cumulative rewards. Deep Q-learning improves upon Q-learning by leveraging deep Q network (DQN) to approximate Q-values, enabling it to handle continuous state spaces but it is still only suitable for discrete action spaces. Further advancement, Deep Deterministic Policy Gradient (DDPG), combines Q-learning’s principles with policy gradients, making it also suitable for continuous action spaces. This blog starts by discussing the basic components of reinforcement learning and gradually explore how Q-learning evolves into DQN and DDPG, with application for solving the cartpole environment in Isaac Gym simulator. Corresponding code can be found at this repository.