Dynamic Programming
in Reinforcement Learning
A complete treatment of Value Iteration, Policy Iteration, and Bellman equations — built from first principles with interactive simulations.
Motivation & Intuition
Imagine a chess player planning moves, a robot navigating a warehouse, or an algorithm managing a financial portfolio. In each case, an agent must make a sequence of decisions where each choice affects future options and rewards. This is the problem of sequential decision making under uncertainty.
Think of an agent in a maze. At each junction, it must choose a direction. The "value" of being at a junction is not just whether there's treasure right there — it's also about whether good paths lie ahead. Dynamic Programming is the systematic framework for computing these values backwards from the goal.
Planning vs. Learning
There are two broad paradigms in RL:
🗺️ Planning (Model-Based)
You have a complete map of the environment — transition probabilities and rewards. You can compute optimal behavior purely by thinking, without interacting. Dynamic Programming is the canonical planning approach.
🎲 Learning (Model-Free)
You don't know the map. You learn by trial and error — interacting with the environment and updating estimates. TD learning, Q-learning, and Monte Carlo are learning methods. They are inspired by DP but work without a model.
Dynamic Programming is foundational: even if we ultimately use model-free methods, understanding DP deeply clarifies what those methods are approximating.
Formally, we operate in a Markov Decision ProcessAn MDP is a tuple (S,A,P,R,γ) where S=states, A=actions, P=transitions, R=rewards, γ=discount. The Markov property means the future depends only on the current state, not history. (MDP). Dynamic Programming solves two fundamental problems:
DP exploits the principle of optimality: an optimal policy has the property that, whatever the initial state and action, the remaining decisions must constitute an optimal policy from the resulting state.
Historical Timeline
The development of Dynamic Programming spans seven decades, weaving together control theory, operations research, and artificial intelligence.
Click any event to expand details.
Bellman Equations Foundation
Intuition: The Recursive Decomposition
The key insight of Bellman is deceptively simple: the value of a state can be decomposed into the immediate reward and the discounted value of the next state. This recursive structure is the engine of all DP algorithms.
"An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision." — Richard Bellman, 1957
Imagine you are at state \(s\). You take action \(a\), receive reward \(r\), and land in state \(s'\). The total future reward starting from \(s\) is:
Of course, \(s'\) and \(r\) are random. Taking expectations over the randomness of transitions and the stochasticity of the policy gives us the Bellman equations.
Formal Equations
This holds for all \(s \in \mathcal{S}\) simultaneously — it's a system of \(|\mathcal{S}|\) linear equations in \(|\mathcal{S}|\) unknowns. For a fixed policy \(\pi\), this system has a unique solution \(V^\pi\).
Define the Bellman operatorThe Bellman operator T^π maps value functions to value functions: (T^π V)(s) = Σ_a π(a|s) Σ_{s',r} p(s',r|s,a)[r + γV(s')]. It is a contraction under the sup-norm with factor γ. \(\mathcal{T}^\pi\): \(\quad (\mathcal{T}^\pi V)(s) = \sum_a \pi(a|s)\sum_{s',r} p(s',r|s,a)[r+\gamma V(s')]\). Then the Bellman equation is simply: \(V^\pi = \mathcal{T}^\pi V^\pi\).
This is now a nonlinear system — the \(\max\) makes it so. Unlike the expectation equation, this cannot be solved by simple linear algebra. Iterative methods (Value Iteration) are needed.
The expectation equation is linear in \(V\) (for fixed \(\pi\)). The optimality equation is nonlinear due to the \(\max\). This is why policy evaluation is easier than finding the optimal policy.
The Q-function (action-value function) is preferred in model-free RL because the greedy policy is \(\pi^*(s) = \arg\max_a Q^*(s,a)\) — no model needed to act. This is the bridge from DP to Q-learning.
The Fixed-Point View
The Bellman equations are fixed-point equations. A fixed point of an operator \(T\) is a point \(V^*\) such that \(TV^* = V^*\). This seemingly circular definition is useful because:
- The Bellman operator is a contraction (under the sup-norm with factor \(\gamma\))
- By the Banach fixed-point theorem, contractions on complete metric spaces have a unique fixed point
- Repeated application of the contraction converges to this unique fixed point
This is exactly what Value Iteration does.
Contraction Mappings
The theoretical reason DP algorithms converge is that the Bellman operators are contractions.
An operator \(T: X \to X\) on a metric space \((X, d)\) is a contraction with factor \(\gamma \in [0,1)\) if: \(\quad d(Tx, Ty) \leq \gamma \cdot d(x, y) \quad \forall x,y \in X\)
For any two value functions \(V, U\):
By symmetry, \((\mathcal{T}^* U)(s) - (\mathcal{T}^* V)(s) \leq \gamma \|V-U\|_\infty\), so \(|(\mathcal{T}^*V)(s) - (\mathcal{T}^*U)(s)| \leq \gamma\|V-U\|_\infty\). Taking sup over \(s\) gives the result. □
By the Banach Fixed-Point Theorem: any contraction on a complete metric space has a unique fixed point, and iteration from any starting point converges to it at a geometric rate \(\gamma^k\).
Watch repeated Bellman updates shrink the error by factor γ each iteration.
Value Iteration Algorithm 1
Value Iteration (VI) starts from any initial guess \(V_0\) and repeatedly applies the Bellman optimality operator:
Because the Bellman optimality operator is a contraction with factor \(\gamma\), we have \(\|V_k - V^*\|_\infty \leq \gamma^k \|V_0 - V^*\|_\infty\), so the error shrinks geometrically.
Convergence Theorem
For any \(V_0 \in \mathbb{R}^{|\mathcal{S}|}\), the Value Iteration sequence \(\{V_k\}\) satisfies: \(\|V_k - V^*\|_\infty \leq \frac{\gamma^k}{1-\gamma}\|V_1 - V_0\|_\infty\).
We want \(\|V_k - V^*\|_\infty \leq \epsilon\). Using the contraction:
So stopping when \(\|V_{k+1}-V_k\|_\infty < \epsilon(1-\gamma)/\gamma\) guarantees error at most \(\epsilon\).
The Algorithm
Initialize \(V(s) = 0\) for all \(s \in \mathcal{S}\), \(V(\text{terminal}) = 0\)
repeat:
\(\Delta \leftarrow 0\)
for each \(s \in \mathcal{S}\):
\(v \leftarrow V(s)\)
\(V(s) \leftarrow \max_a \sum_{s',r} p(s',r|s,a)\left[r + \gamma V(s')\right]\) // Bellman update
\(\Delta \leftarrow \max(\Delta, |v - V(s)|)\)
until \(\Delta < \theta\) // θ is small positive threshold
Output policy: \(\pi^*(s) = \arg\max_a \sum_{s',r} p(s',r|s,a)\left[r + \gamma V(s')\right]\)
✅ Advantages
• Guaranteed convergence
• Simple to implement
• Online: extract policy at any point
• No explicit policy representation needed
⚠️ Disadvantages
• Slow convergence (many iterations)
• Each sweep visits all states
• No intermediate policies extracted
• Requires full model knowledge
Policy Iteration Algorithm 2
Policy Iteration alternates between two phases: policy evaluation (computing \(V^\pi\) for the current policy) and policy improvement (updating the policy greedily).
Phase 1: Policy Evaluation
Repeat until convergence to obtain \(V^\pi\). This is applying the Bellman expectation operator repeatedly — also a contraction, so it converges.
Phase 2: Policy Improvement
If \(\pi'\) is obtained by greedy improvement over \(V^\pi\), then \(V^{\pi'}(s) \geq V^\pi(s)\) for all \(s\). If no improvement is possible, the policy is already optimal: \(\pi = \pi^*\).
The key step: \(V^\pi(s) \leq Q^\pi(s,\pi'(s))\) follows because \(\pi'\) is greedy over \(Q^\pi\), so \(Q^\pi(s,\pi'(s)) = \max_a Q^\pi(s,a) \geq \sum_a \pi(a|s) Q^\pi(s,a) = V^\pi(s)\).
Full Algorithm
repeat:
// Policy Evaluation
repeat:
for each \(s \in \mathcal{S}\):
\(V(s) \leftarrow \sum_a \pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma V(s')]\)
until convergence
// Policy Improvement
policy-stable \(\leftarrow\) true
for each \(s \in \mathcal{S}\):
old \(\leftarrow \pi(s)\)
\(\pi(s) \leftarrow \arg\max_a \sum_{s',r}p(s',r|s,a)[r+\gamma V(s')]\)
if old \(\neq \pi(s)\): policy-stable \(\leftarrow\) false
until policy-stable
Return \(\pi, V\)
Policy Iteration converges in a finite number of iterations: since there are finitely many deterministic policies (\(|\mathcal{A}|^{|\mathcal{S}|}\)) and the policy strictly improves at each step (until optimal), it must terminate.
Value Iteration vs. Policy Iteration
| Aspect | Value Iteration | Policy Iteration |
|---|---|---|
| Core operation | Apply Bellman optimality op. | Evaluate then improve policy |
| Policy evaluation | One Bellman sweep per iteration | Run to full convergence |
| Iterations to convergence | More (geometric rate) | Fewer (finite, often fast) |
| Cost per iteration | Low (\(O(|\mathcal{S}||\mathcal{A}|)\)) | High (eval. cost + improve) |
| Memory | Store \(V\) only | Store \(V\) and \(\pi\) |
| Intermediate policy? | Only at end | Yes, at each iteration |
| Better for | Large state spaces | When eval. is cheap |
In practice, truncated Policy Iteration (running evaluation for a fixed number of steps, not to full convergence) often works better than either extreme. When truncation = 1 sweep, it is exactly Value Iteration.
Generalized Policy Iteration
GPI is the overarching framework encompassing all DP (and most RL) algorithms. The key observation: both evaluation and improvement can be applied partially — the essential property is that they interact in a way that drives the pair \((V, \pi)\) toward \((V^*, \pi^*)\).
The two processes — evaluation and improvement — compete and cooperate, converging to the optimal pair.
In GPI, we never need to run either process to full completion. Even Monte Carlo methods with partial returns, and TD learning with one-step bootstrapping, are instances of GPI. This is the bridge from DP to model-free RL.
Finite vs. Infinite Horizon
Finite Horizon
Episodes terminate at time \(T\). Value depends on time remaining. Solved by backward induction: start at \(V_T = 0\) and work backwards.
Exact solution, but policy is time-dependent.
Infinite Horizon (Discounted)
No terminal state. Discount \(\gamma < 1\) ensures finite total reward. Value is stationary (no time index).
Solved by VI or PI. Policy is stationary.
Interactive Gridworld Simulation Hands-On
Experiment with Value Iteration and Policy Iteration on a 5×5 gridworld. The agent (🤖) starts anywhere and must reach the goal (⭐). Walls block movement; pits give negative reward.
Theoretical Insights
Convergence Rates
Both VI and PI converge, but at different rates:
| Algorithm | Convergence Type | Rate |
|---|---|---|
| Value Iteration | Geometric (linear) | \(\|V_k - V^*\|_\infty \leq \gamma^k \|V_0 - V^*\|_\infty\) |
| Policy Iteration | Finite steps | At most \(|\mathcal{A}|^{|\mathcal{S}|}\) iterations (often much fewer) |
| Linear PI | Superlinear | Newton's method analogy for Bellman operator |
Uniqueness of Fixed Point
The Bellman optimality equation \(V = \mathcal{T}^* V\) has a unique solution \(V^*\), and \(V^*(s) = \max_\pi V^\pi(s)\) for all \(s\). There always exists a deterministic stationary optimal policy \(\pi^*\).
For any stochastic policy \(\pi\) and any state \(s\):
The maximum over deterministic actions is at least as large as any convex combination. So the deterministic greedy policy is always at least as good as any stochastic policy.
Policy Iteration and Newton's Method
There is a deep analogy: Policy Iteration is Newton's method applied to the Bellman optimality equation. This explains its superlinear convergence — once near the optimum, it converges very quickly. Value Iteration, by contrast, is gradient descent, converging geometrically at rate \(\gamma\) per step.
Limitations of Dynamic Programming
🌊 Curse of Dimensionality
The state space \(\mathcal{S}\) grows exponentially with the number of state variables. A robot with 10 joints, each with 100 positions, has \(100^{10} = 10^{20}\) states. Storing and updating \(V\) for all states is computationally infeasible.
🗺️ Requires Full Model
DP needs the transition kernel \(p(s',r|s,a)\) exactly. In most real applications (games, robotics, finance), this is unknown. We can only observe samples from the environment.
⏱️ Synchronous Updates
Standard DP updates all states in each sweep. For large state spaces, this is expensive. Asynchronous DP (updating states in arbitrary order) can help but requires care.
🧠 No Generalization
Tabular DP stores one value per state. It cannot generalize across similar states. Function approximation (neural networks) is needed for large/continuous spaces — but this sacrifices convergence guarantees.
State space size grows exponentially with number of state dimensions (assuming 10 values per dim).
Bridge to Model-Free RL
The limitations of DP are precisely the motivations for model-free methods. Here is how DP ideas translate:
| DP Concept | Model-Free Analog | Key Difference |
|---|---|---|
| Policy Evaluation (exact) | MC / TD Prediction | Use samples instead of model |
| Value Iteration | Q-Learning | Sample Bellman updates, no model |
| Policy Iteration | Actor-Critic | Stochastic approx. for both phases |
| Bellman operator \(\mathcal{T}^*\) | Sample backup in TD | Single sample \((s,a,r,s')\) used |
| Synchronous DP sweeps | Experience replay in DQN | Replay buffer ≈ sweeping with samples |
Model-free RL replaces the expectation in the Bellman update with a sample. Instead of \(\sum_{s',r} p(s',r|s,a)[\ldots]\), we use a single observed \((r, s')\). By stochastic approximation theory, the iterates still converge — but now without any model knowledge.
Q-Learning as Sampled Value Iteration
Summary
Dynamic Programming solves MDPs by exploiting the recursive structure of the Bellman equations. The Bellman operator is a contraction, guaranteeing convergence. Value Iteration applies this operator to find \(V^*\); Policy Iteration alternates evaluation and improvement to converge in finite steps. Both are forms of Generalized Policy Iteration — the unifying framework for all of RL.
| Concept | One-line summary |
|---|---|
| Bellman Expectation | \(V^\pi = \mathcal{T}^\pi V^\pi\): fixed-point equation for policy \(\pi\) |
| Bellman Optimality | \(V^* = \mathcal{T}^* V^*\): nonlinear fixed-point for optimal value |
| Contraction property | Bellman ops shrink distances by \(\gamma\): ensures unique fixed point |
| Value Iteration | Repeatedly apply \(\mathcal{T}^*\); converges geometrically |
| Policy Iteration | Evaluate then improve; converges in finite steps |
| GPI | Any interleaving of evaluation + improvement converges |
| Curse of dimensionality | DP is exact but exponential in state space size |
| Bridge to model-free | Replace expectations with samples → TD, Q-learning, Actor-Critic |
• Sutton & Barto (2018). Reinforcement Learning: An Introduction. Ch. 4.
• Bertsekas (2012). Dynamic Programming and Optimal Control. Vol. 1 & 2.
• Puterman (1994). Markov Decision Processes.
• Bellman, R. (1957). Dynamic Programming. Princeton University Press.