Research-Grade Guide · Dynamic Programming

Dynamic Programming
in Reinforcement Learning

A complete treatment of Value Iteration, Policy Iteration, and Bellman equations — built from first principles with interactive simulations.

📖 ~75 min read 🔢 MathJax equations 🎛️ Interactive gridworld 🎓 PhD-level depth

§ 1

Motivation & Intuition

Imagine a chess player planning moves, a robot navigating a warehouse, or an algorithm managing a financial portfolio. In each case, an agent must make a sequence of decisions where each choice affects future options and rewards. This is the problem of sequential decision making under uncertainty.

Core Metaphor

Think of an agent in a maze. At each junction, it must choose a direction. The "value" of being at a junction is not just whether there's treasure right there — it's also about whether good paths lie ahead. Dynamic Programming is the systematic framework for computing these values backwards from the goal.

Planning vs. Learning

There are two broad paradigms in RL:

🗺️ Planning (Model-Based)

You have a complete map of the environment — transition probabilities and rewards. You can compute optimal behavior purely by thinking, without interacting. Dynamic Programming is the canonical planning approach.

🎲 Learning (Model-Free)

You don't know the map. You learn by trial and error — interacting with the environment and updating estimates. TD learning, Q-learning, and Monte Carlo are learning methods. They are inspired by DP but work without a model.

Dynamic Programming is foundational: even if we ultimately use model-free methods, understanding DP deeply clarifies what those methods are approximating.

Formally, we operate in a Markov Decision ProcessAn MDP is a tuple (S,A,P,R,γ) where S=states, A=actions, P=transitions, R=rewards, γ=discount. The Markov property means the future depends only on the current state, not history. (MDP). Dynamic Programming solves two fundamental problems:

Two Core Problems of DP

\[ \textbf{Prediction: } \text{Given policy } \pi, \text{ compute } V^\pi(s) \text{ for all } s \in \mathcal{S} \] \[ \textbf{Control: } \text{Find the optimal policy } \pi^* = \arg\max_\pi V^\pi(s) \text{ for all } s \]

DP exploits the principle of optimality: an optimal policy has the property that, whatever the initial state and action, the remaining decisions must constitute an optimal policy from the resulting state.

§ 2

Historical Timeline

The development of Dynamic Programming spans seven decades, weaving together control theory, operations research, and artificial intelligence.

📜 Interactive History of Dynamic Programming & RL

Click any event to expand details.

§ 3

Bellman Equations Foundation

Intuition: The Recursive Decomposition

The key insight of Bellman is deceptively simple: the value of a state can be decomposed into the immediate reward and the discounted value of the next state. This recursive structure is the engine of all DP algorithms.

Bellman's Principle of Optimality

"An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision." — Richard Bellman, 1957

Imagine you are at state \(s\). You take action \(a\), receive reward \(r\), and land in state \(s'\). The total future reward starting from \(s\) is:

\[ \underbrace{V(s)}_{\text{value of } s} = \underbrace{r}_{\text{immediate}} + \underbrace{\gamma \cdot V(s')}_{\text{discounted future}} \]

Of course, \(s'\) and \(r\) are random. Taking expectations over the randomness of transitions and the stochasticity of the policy gives us the Bellman equations.

Formal Equations

Bellman Expectation Equation for \(V^\pi\)

\[ V^\pi(s) = \sum_{a} \pi(a|s) \sum_{s',r} p(s',r|s,a)\Big[r + \gamma V^\pi(s')\Big] \]

This holds for all \(s \in \mathcal{S}\) simultaneously — it's a system of \(|\mathcal{S}|\) linear equations in \(|\mathcal{S}|\) unknowns. For a fixed policy \(\pi\), this system has a unique solution \(V^\pi\).

Operator Form

Define the Bellman operatorThe Bellman operator T^π maps value functions to value functions: (T^π V)(s) = Σ_a π(a|s) Σ_{s',r} p(s',r|s,a)[r + γV(s')]. It is a contraction under the sup-norm with factor γ. \(\mathcal{T}^\pi\): \(\quad (\mathcal{T}^\pi V)(s) = \sum_a \pi(a|s)\sum_{s',r} p(s',r|s,a)[r+\gamma V(s')]\). Then the Bellman equation is simply: \(V^\pi = \mathcal{T}^\pi V^\pi\).

Bellman Optimality Equation for \(V^*\)

\[ V^*(s) = \max_{a \in \mathcal{A}} \sum_{s',r} p(s',r|s,a)\Big[r + \gamma V^*(s')\Big] \]

This is now a nonlinear system — the \(\max\) makes it so. Unlike the expectation equation, this cannot be solved by simple linear algebra. Iterative methods (Value Iteration) are needed.

Key Difference

The expectation equation is linear in \(V\) (for fixed \(\pi\)). The optimality equation is nonlinear due to the \(\max\). This is why policy evaluation is easier than finding the optimal policy.

Bellman Optimality for \(Q^*\)

\[ Q^*(s,a) = \sum_{s',r} p(s',r|s,a)\Big[r + \gamma \max_{a'} Q^*(s',a')\Big] \]

The Q-function (action-value function) is preferred in model-free RL because the greedy policy is \(\pi^*(s) = \arg\max_a Q^*(s,a)\) — no model needed to act. This is the bridge from DP to Q-learning.

The Fixed-Point View

The Bellman equations are fixed-point equations. A fixed point of an operator \(T\) is a point \(V^*\) such that \(TV^* = V^*\). This seemingly circular definition is useful because:

The Bellman operator is a contraction (under the sup-norm with factor \(\gamma\))
By the Banach fixed-point theorem, contractions on complete metric spaces have a unique fixed point
Repeated application of the contraction converges to this unique fixed point

This is exactly what Value Iteration does.

§ 4

Contraction Mappings

The theoretical reason DP algorithms converge is that the Bellman operators are contractions.

Definition: Contraction Mapping

An operator \(T: X \to X\) on a metric space \((X, d)\) is a contraction with factor \(\gamma \in [0,1)\) if: \(\quad d(Tx, Ty) \leq \gamma \cdot d(x, y) \quad \forall x,y \in X\)

Bellman Optimality Operator is a Contraction

\[ \|\mathcal{T}^* V - \mathcal{T}^* U\|_\infty \leq \gamma \|V - U\|_\infty \quad \forall V, U \]

📐 Proof that Bellman Operator is a Contraction ▼

For any two value functions \(V, U\):

\[\begin{aligned} (\mathcal{T}^* V)(s) - (\mathcal{T}^* U)(s) &= \max_a \mathbb{E}[r + \gamma V(s')] - \max_a \mathbb{E}[r + \gamma U(s')] \\ &\leq \max_a \mathbb{E}[r + \gamma V(s') - r - \gamma U(s')] \\ &= \gamma \max_a \mathbb{E}[V(s') - U(s')] \\ &\leq \gamma \|V - U\|_\infty \end{aligned}\]

By symmetry, \((\mathcal{T}^* U)(s) - (\mathcal{T}^* V)(s) \leq \gamma \|V-U\|_\infty\), so \(|(\mathcal{T}^*V)(s) - (\mathcal{T}^*U)(s)| \leq \gamma\|V-U\|_\infty\). Taking sup over \(s\) gives the result. □

By the Banach Fixed-Point Theorem: any contraction on a complete metric space has a unique fixed point, and iteration from any starting point converges to it at a geometric rate \(\gamma^k\).

🔄 Contraction Mapping Convergence

Watch repeated Bellman updates shrink the error by factor γ each iteration.

γ = 0.90

Iteration: 0

§ 5

Value Iteration Algorithm 1

Value Iteration (VI) starts from any initial guess \(V_0\) and repeatedly applies the Bellman optimality operator:

Value Iteration Update

\[ V_{k+1}(s) = \max_{a \in \mathcal{A}} \sum_{s',r} p(s',r|s,a)\left[r + \gamma V_k(s')\right] \quad \forall s \in \mathcal{S} \]

Because the Bellman optimality operator is a contraction with factor \(\gamma\), we have \(\|V_k - V^*\|_\infty \leq \gamma^k \|V_0 - V^*\|_\infty\), so the error shrinks geometrically.

Convergence Theorem

Theorem

For any \(V_0 \in \mathbb{R}^{|\mathcal{S}|}\), the Value Iteration sequence \(\{V_k\}\) satisfies: \(\|V_k - V^*\|_\infty \leq \frac{\gamma^k}{1-\gamma}\|V_1 - V_0\|_\infty\).

📐 Stopping Criterion Derivation ▼

We want \(\|V_k - V^*\|_\infty \leq \epsilon\). Using the contraction:

\[\|V_{k+1} - V_k\|_\infty < \delta \implies \|V_k - V^*\|_\infty \leq \frac{\gamma}{1-\gamma}\delta\]

So stopping when \(\|V_{k+1}-V_k\|_\infty < \epsilon(1-\gamma)/\gamma\) guarantees error at most \(\epsilon\).

The Algorithm

Algorithm: Value Iteration

// Initialize arbitrarily
Initialize \(V(s) = 0\) for all \(s \in \mathcal{S}\), \(V(\text{terminal}) = 0\)

repeat:
  \(\Delta \leftarrow 0\)
  for each \(s \in \mathcal{S}\):
    \(v \leftarrow V(s)\)
    \(V(s) \leftarrow \max_a \sum_{s',r} p(s',r|s,a)\left[r + \gamma V(s')\right]\) // Bellman update
    \(\Delta \leftarrow \max(\Delta, |v - V(s)|)\)
until \(\Delta < \theta\) // θ is small positive threshold

Output policy: \(\pi^*(s) = \arg\max_a \sum_{s',r} p(s',r|s,a)\left[r + \gamma V(s')\right]\)

✅ Advantages

• Guaranteed convergence
• Simple to implement
• Online: extract policy at any point
• No explicit policy representation needed

⚠️ Disadvantages

• Slow convergence (many iterations)
• Each sweep visits all states
• No intermediate policies extracted
• Requires full model knowledge

§ 6

Policy Iteration Algorithm 2

Policy Iteration alternates between two phases: policy evaluation (computing \(V^\pi\) for the current policy) and policy improvement (updating the policy greedily).

Phase 1: Policy Evaluation

Iterative Policy Evaluation

\[ V_{k+1}(s) \leftarrow \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a)\left[r + \gamma V_k(s')\right] \quad \forall s \]

Repeat until convergence to obtain \(V^\pi\). This is applying the Bellman expectation operator repeatedly — also a contraction, so it converges.

Phase 2: Policy Improvement

Greedy Policy Improvement

\[ \pi'(s) = \arg\max_a \sum_{s',r} p(s',r|s,a)\left[r + \gamma V^\pi(s')\right] \]

Policy Improvement Theorem

If \(\pi'\) is obtained by greedy improvement over \(V^\pi\), then \(V^{\pi'}(s) \geq V^\pi(s)\) for all \(s\). If no improvement is possible, the policy is already optimal: \(\pi = \pi^*\).

📐 Proof of Policy Improvement Theorem ▼

\[\begin{aligned} V^\pi(s) &\leq Q^\pi(s, \pi'(s)) = \mathbb{E}[R_{t+1} + \gamma V^\pi(S_{t+1}) | S_t=s, A_t=\pi'(s)]\\ &\leq \mathbb{E}[R_{t+1} + \gamma Q^\pi(S_{t+1}, \pi'(S_{t+1})) | \ldots]\\ &\leq \mathbb{E}\left[\sum_{k=0}^\infty \gamma^k R_{t+k+1} | S_t=s, \pi'\right] = V^{\pi'}(s) \end{aligned}\]

The key step: \(V^\pi(s) \leq Q^\pi(s,\pi'(s))\) follows because \(\pi'\) is greedy over \(Q^\pi\), so \(Q^\pi(s,\pi'(s)) = \max_a Q^\pi(s,a) \geq \sum_a \pi(a|s) Q^\pi(s,a) = V^\pi(s)\).

Full Algorithm

Algorithm: Policy Iteration

Initialize \(\pi\) arbitrarily, \(V(s) = 0\) for all \(s\)

repeat:
  // Policy Evaluation
  repeat:
    for each \(s \in \mathcal{S}\):
      \(V(s) \leftarrow \sum_a \pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma V(s')]\)
  until convergence

  // Policy Improvement
  policy-stable \(\leftarrow\) true
  for each \(s \in \mathcal{S}\):
    old \(\leftarrow \pi(s)\)
    \(\pi(s) \leftarrow \arg\max_a \sum_{s',r}p(s',r|s,a)[r+\gamma V(s')]\)
    if old \(\neq \pi(s)\): policy-stable \(\leftarrow\) false
until policy-stable
Return \(\pi, V\)

Finite Convergence

Policy Iteration converges in a finite number of iterations: since there are finitely many deterministic policies (\(|\mathcal{A}|^{|\mathcal{S}|}\)) and the policy strictly improves at each step (until optimal), it must terminate.

§ 7

Value Iteration vs. Policy Iteration

Aspect	Value Iteration	Policy Iteration
Core operation	Apply Bellman optimality op.	Evaluate then improve policy
Policy evaluation	One Bellman sweep per iteration	Run to full convergence
Iterations to convergence	More (geometric rate)	Fewer (finite, often fast)
Cost per iteration	Low (\(O(\|\mathcal{S}\|\|\mathcal{A}\|)\))	High (eval. cost + improve)
Memory	Store \(V\) only	Store \(V\) and \(\pi\)
Intermediate policy?	Only at end	Yes, at each iteration
Better for	Large state spaces	When eval. is cheap

Practical Insight

In practice, truncated Policy Iteration (running evaluation for a fixed number of steps, not to full convergence) often works better than either extreme. When truncation = 1 sweep, it is exactly Value Iteration.

§ 8

Generalized Policy Iteration

GPI is the overarching framework encompassing all DP (and most RL) algorithms. The key observation: both evaluation and improvement can be applied partially — the essential property is that they interact in a way that drives the pair \((V, \pi)\) toward \((V^*, \pi^*)\).

♾️ GPI: The Unifying Framework

The two processes — evaluation and improvement — compete and cooperate, converging to the optimal pair.

In GPI, we never need to run either process to full completion. Even Monte Carlo methods with partial returns, and TD learning with one-step bootstrapping, are instances of GPI. This is the bridge from DP to model-free RL.

§ 9

Finite vs. Infinite Horizon

Finite Horizon

Episodes terminate at time \(T\). Value depends on time remaining. Solved by backward induction: start at \(V_T = 0\) and work backwards.

\(V_t(s) = \max_a \mathbb{E}[R_{t+1} + V_{t+1}(S_{t+1})]\)

Exact solution, but policy is time-dependent.

Infinite Horizon (Discounted)

No terminal state. Discount \(\gamma < 1\) ensures finite total reward. Value is stationary (no time index).

\(V^*(s) = \max_a \mathbb{E}[R + \gamma V^*(S')]\)

Solved by VI or PI. Policy is stationary.

§ 10

Interactive Gridworld Simulation Hands-On

Experiment with Value Iteration and Policy Iteration on a 5×5 gridworld. The agent (🤖) starts anywhere and must reach the goal (⭐). Walls block movement; pits give negative reward.

🗺️ Gridworld — Value & Policy Visualization

Click cell to toggle wall · Shift+click = pit

Value Function V(s)

Heat map: darker red = higher value

Iteration: 0 | Max ΔV: — | Phase: —

γ = 0.90

Press Step or Auto Run

🤖 Agent start ⭐ Goal (+10) 🧱 Wall ☠️ Pit (−5) Arrows = greedy policy

§ 11

Theoretical Insights

Convergence Rates

Both VI and PI converge, but at different rates:

Algorithm	Convergence Type	Rate
Value Iteration	Geometric (linear)	\(\\|V_k - V^\\|_\infty \leq \gamma^k \\|V_0 - V^\\|_\infty\)
Policy Iteration	Finite steps	At most \(\|\mathcal{A}\|^{\|\mathcal{S}\|}\) iterations (often much fewer)
Linear PI	Superlinear	Newton's method analogy for Bellman operator

Uniqueness of Fixed Point

Theorem: Unique Optimal Value Function

The Bellman optimality equation \(V = \mathcal{T}^* V\) has a unique solution \(V^*\), and \(V^*(s) = \max_\pi V^\pi(s)\) for all \(s\). There always exists a deterministic stationary optimal policy \(\pi^*\).

📐 Why Deterministic Policies Suffice ▼

For any stochastic policy \(\pi\) and any state \(s\):

\[V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s,a) \leq \max_a Q^\pi(s,a) = Q^{\pi^*_{\text{det}}}(s, \pi^*_{\text{det}}(s))\]

The maximum over deterministic actions is at least as large as any convex combination. So the deterministic greedy policy is always at least as good as any stochastic policy.

Policy Iteration and Newton's Method

There is a deep analogy: Policy Iteration is Newton's method applied to the Bellman optimality equation. This explains its superlinear convergence — once near the optimum, it converges very quickly. Value Iteration, by contrast, is gradient descent, converging geometrically at rate \(\gamma\) per step.

§ 12

Limitations of Dynamic Programming

🌊 Curse of Dimensionality

The state space \(\mathcal{S}\) grows exponentially with the number of state variables. A robot with 10 joints, each with 100 positions, has \(100^{10} = 10^{20}\) states. Storing and updating \(V\) for all states is computationally infeasible.

🗺️ Requires Full Model

DP needs the transition kernel \(p(s',r|s,a)\) exactly. In most real applications (games, robotics, finance), this is unknown. We can only observe samples from the environment.

⏱️ Synchronous Updates

Standard DP updates all states in each sweep. For large state spaces, this is expensive. Asynchronous DP (updating states in arbitrary order) can help but requires care.

🧠 No Generalization

Tabular DP stores one value per state. It cannot generalize across similar states. Function approximation (neural networks) is needed for large/continuous spaces — but this sacrifices convergence guarantees.

📈 Curse of Dimensionality

State space size grows exponentially with number of state dimensions (assuming 10 values per dim).

§ 13

Bridge to Model-Free RL

The limitations of DP are precisely the motivations for model-free methods. Here is how DP ideas translate:

🌉 From DP to Model-Free RL

DP Concept	Model-Free Analog	Key Difference
Policy Evaluation (exact)	MC / TD Prediction	Use samples instead of model
Value Iteration	Q-Learning	Sample Bellman updates, no model
Policy Iteration	Actor-Critic	Stochastic approx. for both phases
Bellman operator \(\mathcal{T}^*\)	Sample backup in TD	Single sample \((s,a,r,s')\) used
Synchronous DP sweeps	Experience replay in DQN	Replay buffer ≈ sweeping with samples

The Fundamental Insight

Model-free RL replaces the expectation in the Bellman update with a sample. Instead of \(\sum_{s',r} p(s',r|s,a)[\ldots]\), we use a single observed \((r, s')\). By stochastic approximation theory, the iterates still converge — but now without any model knowledge.

Q-Learning as Sampled Value Iteration

Value Iteration → Q-Learning

\[ \underbrace{V_{k+1}(s) = \max_a \sum_{s',r} p(s',r|s,a)[r+\gamma V_k(s')]}_{\text{DP: full expectation}} \;\longrightarrow\; \underbrace{Q(s,a) \mathrel{+}= \alpha\left[r + \gamma\max_{a'}Q(s',a') - Q(s,a)\right]}_{\text{Q-learning: single sample}} \]

§ 14

Summary

The Big Picture

Dynamic Programming solves MDPs by exploiting the recursive structure of the Bellman equations. The Bellman operator is a contraction, guaranteeing convergence. Value Iteration applies this operator to find \(V^*\); Policy Iteration alternates evaluation and improvement to converge in finite steps. Both are forms of Generalized Policy Iteration — the unifying framework for all of RL.

Concept	One-line summary
Bellman Expectation	\(V^\pi = \mathcal{T}^\pi V^\pi\): fixed-point equation for policy \(\pi\)
Bellman Optimality	\(V^* = \mathcal{T}^* V^*\): nonlinear fixed-point for optimal value
Contraction property	Bellman ops shrink distances by \(\gamma\): ensures unique fixed point
Value Iteration	Repeatedly apply \(\mathcal{T}^*\); converges geometrically
Policy Iteration	Evaluate then improve; converges in finite steps
GPI	Any interleaving of evaluation + improvement converges
Curse of dimensionality	DP is exact but exponential in state space size
Bridge to model-free	Replace expectations with samples → TD, Q-learning, Actor-Critic

Essential Reading

• Sutton & Barto (2018). Reinforcement Learning: An Introduction. Ch. 4.
• Bertsekas (2012). Dynamic Programming and Optimal Control. Vol. 1 & 2.
• Puterman (1994). Markov Decision Processes.
• Bellman, R. (1957). Dynamic Programming. Princeton University Press.

Dynamic Programming in Reinforcement Learning — Research Guide