Research-Grade Guide · Dynamic Programming

Dynamic Programming
in Reinforcement Learning

A complete treatment of Value Iteration, Policy Iteration, and Bellman equations — built from first principles with interactive simulations.

📖 ~75 min read 🔢 MathJax equations 🎛️ Interactive gridworld 🎓 PhD-level depth
§ 1

Motivation & Intuition

Imagine a chess player planning moves, a robot navigating a warehouse, or an algorithm managing a financial portfolio. In each case, an agent must make a sequence of decisions where each choice affects future options and rewards. This is the problem of sequential decision making under uncertainty.

Core Metaphor

Think of an agent in a maze. At each junction, it must choose a direction. The "value" of being at a junction is not just whether there's treasure right there — it's also about whether good paths lie ahead. Dynamic Programming is the systematic framework for computing these values backwards from the goal.

Planning vs. Learning

There are two broad paradigms in RL:

🗺️ Planning (Model-Based)

You have a complete map of the environment — transition probabilities and rewards. You can compute optimal behavior purely by thinking, without interacting. Dynamic Programming is the canonical planning approach.

🎲 Learning (Model-Free)

You don't know the map. You learn by trial and error — interacting with the environment and updating estimates. TD learning, Q-learning, and Monte Carlo are learning methods. They are inspired by DP but work without a model.

Dynamic Programming is foundational: even if we ultimately use model-free methods, understanding DP deeply clarifies what those methods are approximating.

Formally, we operate in a Markov Decision ProcessAn MDP is a tuple (S,A,P,R,γ) where S=states, A=actions, P=transitions, R=rewards, γ=discount. The Markov property means the future depends only on the current state, not history. (MDP). Dynamic Programming solves two fundamental problems:

Two Core Problems of DP
\[ \textbf{Prediction: } \text{Given policy } \pi, \text{ compute } V^\pi(s) \text{ for all } s \in \mathcal{S} \] \[ \textbf{Control: } \text{Find the optimal policy } \pi^* = \arg\max_\pi V^\pi(s) \text{ for all } s \]

DP exploits the principle of optimality: an optimal policy has the property that, whatever the initial state and action, the remaining decisions must constitute an optimal policy from the resulting state.


§ 2

Historical Timeline

The development of Dynamic Programming spans seven decades, weaving together control theory, operations research, and artificial intelligence.

📜 Interactive History of Dynamic Programming & RL

Click any event to expand details.


§ 3

Bellman Equations Foundation

Intuition: The Recursive Decomposition

The key insight of Bellman is deceptively simple: the value of a state can be decomposed into the immediate reward and the discounted value of the next state. This recursive structure is the engine of all DP algorithms.

Bellman's Principle of Optimality

"An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision." — Richard Bellman, 1957

Imagine you are at state \(s\). You take action \(a\), receive reward \(r\), and land in state \(s'\). The total future reward starting from \(s\) is:

\[ \underbrace{V(s)}_{\text{value of } s} = \underbrace{r}_{\text{immediate}} + \underbrace{\gamma \cdot V(s')}_{\text{discounted future}} \]

Of course, \(s'\) and \(r\) are random. Taking expectations over the randomness of transitions and the stochasticity of the policy gives us the Bellman equations.

Formal Equations

Bellman Expectation Equation for \(V^\pi\)
\[ V^\pi(s) = \sum_{a} \pi(a|s) \sum_{s',r} p(s',r|s,a)\Big[r + \gamma V^\pi(s')\Big] \]

This holds for all \(s \in \mathcal{S}\) simultaneously — it's a system of \(|\mathcal{S}|\) linear equations in \(|\mathcal{S}|\) unknowns. For a fixed policy \(\pi\), this system has a unique solution \(V^\pi\).

Operator Form

Define the Bellman operatorThe Bellman operator T^π maps value functions to value functions: (T^π V)(s) = Σ_a π(a|s) Σ_{s',r} p(s',r|s,a)[r + γV(s')]. It is a contraction under the sup-norm with factor γ. \(\mathcal{T}^\pi\): \(\quad (\mathcal{T}^\pi V)(s) = \sum_a \pi(a|s)\sum_{s',r} p(s',r|s,a)[r+\gamma V(s')]\). Then the Bellman equation is simply: \(V^\pi = \mathcal{T}^\pi V^\pi\).

Bellman Optimality Equation for \(V^*\)
\[ V^*(s) = \max_{a \in \mathcal{A}} \sum_{s',r} p(s',r|s,a)\Big[r + \gamma V^*(s')\Big] \]

This is now a nonlinear system — the \(\max\) makes it so. Unlike the expectation equation, this cannot be solved by simple linear algebra. Iterative methods (Value Iteration) are needed.

Key Difference

The expectation equation is linear in \(V\) (for fixed \(\pi\)). The optimality equation is nonlinear due to the \(\max\). This is why policy evaluation is easier than finding the optimal policy.

Bellman Optimality for \(Q^*\)
\[ Q^*(s,a) = \sum_{s',r} p(s',r|s,a)\Big[r + \gamma \max_{a'} Q^*(s',a')\Big] \]

The Q-function (action-value function) is preferred in model-free RL because the greedy policy is \(\pi^*(s) = \arg\max_a Q^*(s,a)\) — no model needed to act. This is the bridge from DP to Q-learning.

The Fixed-Point View

The Bellman equations are fixed-point equations. A fixed point of an operator \(T\) is a point \(V^*\) such that \(TV^* = V^*\). This seemingly circular definition is useful because:

  1. The Bellman operator is a contraction (under the sup-norm with factor \(\gamma\))
  2. By the Banach fixed-point theorem, contractions on complete metric spaces have a unique fixed point
  3. Repeated application of the contraction converges to this unique fixed point

This is exactly what Value Iteration does.


§ 4

Contraction Mappings

The theoretical reason DP algorithms converge is that the Bellman operators are contractions.

Definition: Contraction Mapping

An operator \(T: X \to X\) on a metric space \((X, d)\) is a contraction with factor \(\gamma \in [0,1)\) if: \(\quad d(Tx, Ty) \leq \gamma \cdot d(x, y) \quad \forall x,y \in X\)

Bellman Optimality Operator is a Contraction
\[ \|\mathcal{T}^* V - \mathcal{T}^* U\|_\infty \leq \gamma \|V - U\|_\infty \quad \forall V, U \]
📐 Proof that Bellman Operator is a Contraction

For any two value functions \(V, U\):

\[\begin{aligned} (\mathcal{T}^* V)(s) - (\mathcal{T}^* U)(s) &= \max_a \mathbb{E}[r + \gamma V(s')] - \max_a \mathbb{E}[r + \gamma U(s')] \\ &\leq \max_a \mathbb{E}[r + \gamma V(s') - r - \gamma U(s')] \\ &= \gamma \max_a \mathbb{E}[V(s') - U(s')] \\ &\leq \gamma \|V - U\|_\infty \end{aligned}\]

By symmetry, \((\mathcal{T}^* U)(s) - (\mathcal{T}^* V)(s) \leq \gamma \|V-U\|_\infty\), so \(|(\mathcal{T}^*V)(s) - (\mathcal{T}^*U)(s)| \leq \gamma\|V-U\|_\infty\). Taking sup over \(s\) gives the result. □

By the Banach Fixed-Point Theorem: any contraction on a complete metric space has a unique fixed point, and iteration from any starting point converges to it at a geometric rate \(\gamma^k\).

🔄 Contraction Mapping Convergence

Watch repeated Bellman updates shrink the error by factor γ each iteration.

Iteration: 0

§ 5

Value Iteration Algorithm 1

Value Iteration (VI) starts from any initial guess \(V_0\) and repeatedly applies the Bellman optimality operator:

Value Iteration Update
\[ V_{k+1}(s) = \max_{a \in \mathcal{A}} \sum_{s',r} p(s',r|s,a)\left[r + \gamma V_k(s')\right] \quad \forall s \in \mathcal{S} \]

Because the Bellman optimality operator is a contraction with factor \(\gamma\), we have \(\|V_k - V^*\|_\infty \leq \gamma^k \|V_0 - V^*\|_\infty\), so the error shrinks geometrically.

Convergence Theorem

Theorem

For any \(V_0 \in \mathbb{R}^{|\mathcal{S}|}\), the Value Iteration sequence \(\{V_k\}\) satisfies: \(\|V_k - V^*\|_\infty \leq \frac{\gamma^k}{1-\gamma}\|V_1 - V_0\|_\infty\).

📐 Stopping Criterion Derivation

We want \(\|V_k - V^*\|_\infty \leq \epsilon\). Using the contraction:

\[\|V_{k+1} - V_k\|_\infty < \delta \implies \|V_k - V^*\|_\infty \leq \frac{\gamma}{1-\gamma}\delta\]

So stopping when \(\|V_{k+1}-V_k\|_\infty < \epsilon(1-\gamma)/\gamma\) guarantees error at most \(\epsilon\).

The Algorithm

Algorithm: Value Iteration
// Initialize arbitrarily
Initialize \(V(s) = 0\) for all \(s \in \mathcal{S}\), \(V(\text{terminal}) = 0\)

repeat:
  \(\Delta \leftarrow 0\)
  for each \(s \in \mathcal{S}\):
    \(v \leftarrow V(s)\)
    \(V(s) \leftarrow \max_a \sum_{s',r} p(s',r|s,a)\left[r + \gamma V(s')\right]\)  // Bellman update
    \(\Delta \leftarrow \max(\Delta, |v - V(s)|)\)
until \(\Delta < \theta\)  // θ is small positive threshold

Output policy: \(\pi^*(s) = \arg\max_a \sum_{s',r} p(s',r|s,a)\left[r + \gamma V(s')\right]\)

✅ Advantages

• Guaranteed convergence
• Simple to implement
• Online: extract policy at any point
• No explicit policy representation needed

⚠️ Disadvantages

• Slow convergence (many iterations)
• Each sweep visits all states
• No intermediate policies extracted
• Requires full model knowledge


§ 6

Policy Iteration Algorithm 2

Policy Iteration alternates between two phases: policy evaluation (computing \(V^\pi\) for the current policy) and policy improvement (updating the policy greedily).

Phase 1: Policy Evaluation

Iterative Policy Evaluation
\[ V_{k+1}(s) \leftarrow \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a)\left[r + \gamma V_k(s')\right] \quad \forall s \]

Repeat until convergence to obtain \(V^\pi\). This is applying the Bellman expectation operator repeatedly — also a contraction, so it converges.

Phase 2: Policy Improvement

Greedy Policy Improvement
\[ \pi'(s) = \arg\max_a \sum_{s',r} p(s',r|s,a)\left[r + \gamma V^\pi(s')\right] \]
Policy Improvement Theorem

If \(\pi'\) is obtained by greedy improvement over \(V^\pi\), then \(V^{\pi'}(s) \geq V^\pi(s)\) for all \(s\). If no improvement is possible, the policy is already optimal: \(\pi = \pi^*\).

📐 Proof of Policy Improvement Theorem
\[\begin{aligned} V^\pi(s) &\leq Q^\pi(s, \pi'(s)) = \mathbb{E}[R_{t+1} + \gamma V^\pi(S_{t+1}) | S_t=s, A_t=\pi'(s)]\\ &\leq \mathbb{E}[R_{t+1} + \gamma Q^\pi(S_{t+1}, \pi'(S_{t+1})) | \ldots]\\ &\leq \mathbb{E}\left[\sum_{k=0}^\infty \gamma^k R_{t+k+1} | S_t=s, \pi'\right] = V^{\pi'}(s) \end{aligned}\]

The key step: \(V^\pi(s) \leq Q^\pi(s,\pi'(s))\) follows because \(\pi'\) is greedy over \(Q^\pi\), so \(Q^\pi(s,\pi'(s)) = \max_a Q^\pi(s,a) \geq \sum_a \pi(a|s) Q^\pi(s,a) = V^\pi(s)\).

Full Algorithm

Algorithm: Policy Iteration
Initialize \(\pi\) arbitrarily, \(V(s) = 0\) for all \(s\)

repeat:
  // Policy Evaluation
  repeat:
    for each \(s \in \mathcal{S}\):
      \(V(s) \leftarrow \sum_a \pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma V(s')]\)
  until convergence

  // Policy Improvement
  policy-stable \(\leftarrow\) true
  for each \(s \in \mathcal{S}\):
    old \(\leftarrow \pi(s)\)
    \(\pi(s) \leftarrow \arg\max_a \sum_{s',r}p(s',r|s,a)[r+\gamma V(s')]\)
    if old \(\neq \pi(s)\): policy-stable \(\leftarrow\) false
until policy-stable
Return \(\pi, V\)
Finite Convergence

Policy Iteration converges in a finite number of iterations: since there are finitely many deterministic policies (\(|\mathcal{A}|^{|\mathcal{S}|}\)) and the policy strictly improves at each step (until optimal), it must terminate.


§ 7

Value Iteration vs. Policy Iteration

AspectValue IterationPolicy Iteration
Core operationApply Bellman optimality op.Evaluate then improve policy
Policy evaluationOne Bellman sweep per iterationRun to full convergence
Iterations to convergenceMore (geometric rate)Fewer (finite, often fast)
Cost per iterationLow (\(O(|\mathcal{S}||\mathcal{A}|)\))High (eval. cost + improve)
MemoryStore \(V\) onlyStore \(V\) and \(\pi\)
Intermediate policy?Only at endYes, at each iteration
Better forLarge state spacesWhen eval. is cheap
Practical Insight

In practice, truncated Policy Iteration (running evaluation for a fixed number of steps, not to full convergence) often works better than either extreme. When truncation = 1 sweep, it is exactly Value Iteration.


§ 8

Generalized Policy Iteration

GPI is the overarching framework encompassing all DP (and most RL) algorithms. The key observation: both evaluation and improvement can be applied partially — the essential property is that they interact in a way that drives the pair \((V, \pi)\) toward \((V^*, \pi^*)\).

♾️ GPI: The Unifying Framework

The two processes — evaluation and improvement — compete and cooperate, converging to the optimal pair.

In GPI, we never need to run either process to full completion. Even Monte Carlo methods with partial returns, and TD learning with one-step bootstrapping, are instances of GPI. This is the bridge from DP to model-free RL.


§ 9

Finite vs. Infinite Horizon

Finite Horizon

Episodes terminate at time \(T\). Value depends on time remaining. Solved by backward induction: start at \(V_T = 0\) and work backwards.

\(V_t(s) = \max_a \mathbb{E}[R_{t+1} + V_{t+1}(S_{t+1})]\)

Exact solution, but policy is time-dependent.

Infinite Horizon (Discounted)

No terminal state. Discount \(\gamma < 1\) ensures finite total reward. Value is stationary (no time index).

\(V^*(s) = \max_a \mathbb{E}[R + \gamma V^*(S')]\)

Solved by VI or PI. Policy is stationary.


§ 10

Interactive Gridworld Simulation Hands-On

Experiment with Value Iteration and Policy Iteration on a 5×5 gridworld. The agent (🤖) starts anywhere and must reach the goal (⭐). Walls block movement; pits give negative reward.

🗺️ Gridworld — Value & Policy Visualization
Click cell to toggle wall · Shift+click = pit
Value Function V(s)
Heat map: darker red = higher value
Iteration: 0  |  Max ΔV:  |  Phase:
Press Step or Auto Run
🤖 Agent start   ⭐ Goal (+10)   🧱 Wall   ☠️ Pit (−5)   Arrows = greedy policy

§ 11

Theoretical Insights

Convergence Rates

Both VI and PI converge, but at different rates:

AlgorithmConvergence TypeRate
Value IterationGeometric (linear)\(\|V_k - V^*\|_\infty \leq \gamma^k \|V_0 - V^*\|_\infty\)
Policy IterationFinite stepsAt most \(|\mathcal{A}|^{|\mathcal{S}|}\) iterations (often much fewer)
Linear PISuperlinearNewton's method analogy for Bellman operator

Uniqueness of Fixed Point

Theorem: Unique Optimal Value Function

The Bellman optimality equation \(V = \mathcal{T}^* V\) has a unique solution \(V^*\), and \(V^*(s) = \max_\pi V^\pi(s)\) for all \(s\). There always exists a deterministic stationary optimal policy \(\pi^*\).

📐 Why Deterministic Policies Suffice

For any stochastic policy \(\pi\) and any state \(s\):

\[V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s,a) \leq \max_a Q^\pi(s,a) = Q^{\pi^*_{\text{det}}}(s, \pi^*_{\text{det}}(s))\]

The maximum over deterministic actions is at least as large as any convex combination. So the deterministic greedy policy is always at least as good as any stochastic policy.

Policy Iteration and Newton's Method

There is a deep analogy: Policy Iteration is Newton's method applied to the Bellman optimality equation. This explains its superlinear convergence — once near the optimum, it converges very quickly. Value Iteration, by contrast, is gradient descent, converging geometrically at rate \(\gamma\) per step.


§ 12

Limitations of Dynamic Programming

🌊 Curse of Dimensionality

The state space \(\mathcal{S}\) grows exponentially with the number of state variables. A robot with 10 joints, each with 100 positions, has \(100^{10} = 10^{20}\) states. Storing and updating \(V\) for all states is computationally infeasible.

🗺️ Requires Full Model

DP needs the transition kernel \(p(s',r|s,a)\) exactly. In most real applications (games, robotics, finance), this is unknown. We can only observe samples from the environment.

⏱️ Synchronous Updates

Standard DP updates all states in each sweep. For large state spaces, this is expensive. Asynchronous DP (updating states in arbitrary order) can help but requires care.

🧠 No Generalization

Tabular DP stores one value per state. It cannot generalize across similar states. Function approximation (neural networks) is needed for large/continuous spaces — but this sacrifices convergence guarantees.

📈 Curse of Dimensionality

State space size grows exponentially with number of state dimensions (assuming 10 values per dim).


§ 13

Bridge to Model-Free RL

The limitations of DP are precisely the motivations for model-free methods. Here is how DP ideas translate:

🌉 From DP to Model-Free RL
DP ConceptModel-Free AnalogKey Difference
Policy Evaluation (exact)MC / TD PredictionUse samples instead of model
Value IterationQ-LearningSample Bellman updates, no model
Policy IterationActor-CriticStochastic approx. for both phases
Bellman operator \(\mathcal{T}^*\)Sample backup in TDSingle sample \((s,a,r,s')\) used
Synchronous DP sweepsExperience replay in DQNReplay buffer ≈ sweeping with samples
The Fundamental Insight

Model-free RL replaces the expectation in the Bellman update with a sample. Instead of \(\sum_{s',r} p(s',r|s,a)[\ldots]\), we use a single observed \((r, s')\). By stochastic approximation theory, the iterates still converge — but now without any model knowledge.

Q-Learning as Sampled Value Iteration

Value Iteration → Q-Learning
\[ \underbrace{V_{k+1}(s) = \max_a \sum_{s',r} p(s',r|s,a)[r+\gamma V_k(s')]}_{\text{DP: full expectation}} \;\longrightarrow\; \underbrace{Q(s,a) \mathrel{+}= \alpha\left[r + \gamma\max_{a'}Q(s',a') - Q(s,a)\right]}_{\text{Q-learning: single sample}} \]

§ 14

Summary

The Big Picture

Dynamic Programming solves MDPs by exploiting the recursive structure of the Bellman equations. The Bellman operator is a contraction, guaranteeing convergence. Value Iteration applies this operator to find \(V^*\); Policy Iteration alternates evaluation and improvement to converge in finite steps. Both are forms of Generalized Policy Iteration — the unifying framework for all of RL.

ConceptOne-line summary
Bellman Expectation\(V^\pi = \mathcal{T}^\pi V^\pi\): fixed-point equation for policy \(\pi\)
Bellman Optimality\(V^* = \mathcal{T}^* V^*\): nonlinear fixed-point for optimal value
Contraction propertyBellman ops shrink distances by \(\gamma\): ensures unique fixed point
Value IterationRepeatedly apply \(\mathcal{T}^*\); converges geometrically
Policy IterationEvaluate then improve; converges in finite steps
GPIAny interleaving of evaluation + improvement converges
Curse of dimensionalityDP is exact but exponential in state space size
Bridge to model-freeReplace expectations with samples → TD, Q-learning, Actor-Critic
Essential Reading

• Sutton & Barto (2018). Reinforcement Learning: An Introduction. Ch. 4.
• Bertsekas (2012). Dynamic Programming and Optimal Control. Vol. 1 & 2.
• Puterman (1994). Markov Decision Processes.
• Bellman, R. (1957). Dynamic Programming. Princeton University Press.

Dynamic Programming in Reinforcement Learning — Research Guide