A rigorous &intuitive treatment — from discrete-time dynamic programming to the continuous-time HJB equation — for the graduate RL researcher.
1. Motivation & Intuition
Imagine a chess player evaluating board positions. They don't compute the outcome of every possible game to the end — instead, they develop an intuition about how "good" each position is. This intuition is exactly the value functionA mapping from states (and optionally actions) to expected cumulative reward. Encodes how desirable a situation is under a given policy..
The central idea of dynamic programming is that optimal sequential decisions can be decomposed into a current decision plus the optimal value of the future. This is Bellman's principle of optimality.
The Bellman equation makes this recursive: the value of being in a state equals the immediate reward you collect plus (discounted) value of the next state. This recursion is the engine of almost all model-based RL algorithms.
The Hamilton–Jacobi–Bellman equationThe continuous-time PDE analogue of the Bellman optimality equation. Characterizes the optimal value function as a solution to a partial differential equation involving the system dynamics. is what Bellman's equation becomes when time is continuous and the state evolves according to a differential equation — the natural setting for control theory and physics-based problems.
We operate in a Markov Decision Process (MDP): a tuple $(\mathcal{S}, \mathcal{A}, P, r, \gamma)$ where $\mathcal{S}$ is the state space, $\mathcal{A}$ the action space, $P(s'|s,a)$ the transition kernel, $r(s,a)$ the reward function, and $\gamma \in [0,1)$ the discount factor.
An agent follows a policy $\pi: \mathcal{S} \to \Delta(\mathcal{A})$ (stationary, possibly stochastic). The goal is to find $\pi^*$ that maximizes expected discounted return:
The key insight is that this infinite-horizon problem has recursive structure — the basis for Bellman's work and subsequently the entire edifice of DP and RL.
2. Value Functions
2.1 State-Value Function $V^\pi(s)$
$V^\pi(s)$ answers: "If I'm in state $s$ and follow policy $\pi$ forever, how much total reward can I expect to accumulate?"
Think of it as the "price" of being in a state under a given behavioral strategy. Higher $V^\pi(s)$ = better state.
The expectation is over the stochasticity in both the policy $\pi(\cdot|s)$ and the transition kernel $P(\cdot|s,a)$.
2.2 Action-Value (Q) Function $Q^\pi(s,a)$
$Q^\pi(s,a)$ answers: "If I'm in state $s$, take action $a$ right now (even if $\pi$ wouldn't normally suggest it), then follow $\pi$ forever after — how much reward do I get?"
The Q-function is more informative than V — it tells you the value of each action, enabling direct policy improvement.
The two functions are related by:
$$V^\pi(s) = \sum_{a} \pi(a|s) \, Q^\pi(s,a) = \mathbb{E}_{a \sim \pi(\cdot|s)}[Q^\pi(s,a)]$$2.3 Notation Summary
3. Bellman Expectation Equations
The Bellman equations give recursive characterizations of value functions. They are the central structural property enabling computation of $V^\pi$ and $Q^\pi$.
3.1 Bellman Equation for $V^\pi$
The value of a state is the immediate reward you expect to get (by averaging over actions and next states) plus the discounted value of wherever you end up next. The value decomposes into now and everything after now.
Starting from the definition and splitting off the first term:
Equivalently in operator form: $V^\pi = \mathcal{T}^\pi V^\pi$ where the Bellman operator$\mathcal{T}^\pi$ maps any function $f: \mathcal{S} \to \mathbb{R}$ to $(\mathcal{T}^\pi f)(s) = \sum_a \pi(a|s)[r(s,a) + \gamma \sum_{s'} P(s'|s,a)f(s')]$. It is a $\gamma$-contraction in $\ell^\infty$, so repeated application converges to $V^\pi$. $\mathcal{T}^\pi$ is a $\gamma$-contraction mapping whose unique fixed point is $V^\pi$.
3.2 Bellman Equation for $Q^\pi$
The Q-function's Bellman equation shows: take action $a$, observe reward $r(s,a)$, land in $s'$, then follow $\pi$ (averaging over $\pi$'s choice of $a'$ in $s'$).
$V^\pi$ and $Q^\pi$ are interchangeable: $V^\pi(s) = \mathbb{E}_{a\sim\pi}[Q^\pi(s,a)]$ and $Q^\pi(s,a) = r(s,a) + \gamma \mathbb{E}_{s'}[V^\pi(s')]$. This symmetry drives policy gradient theorems.
4. Bellman Optimality Equations
What if we don't commit to any fixed policy, but instead ask: what is the best possible value achievable from each state? The optimal value function$V^*(s) = \max_\pi V^\pi(s)$. The envelope over all policies. Similarly, $Q^*(s,a) = \max_\pi Q^\pi(s,a)$. $V^*$ is the ceiling — no policy can do better.
The Bellman optimality equation says: at each state, you get to pick the best action, then nature picks the next state.
Define $V^*(s) = \max_\pi V^\pi(s)$ and $Q^*(s,a) = \max_\pi Q^\pi(s,a)$. By Bellman's principle of optimality:
Unlike the Bellman expectation equations (which are linear in $V^\pi$), the optimality equations are nonlinear due to the $\max$ operator. This nonlinearity is what makes them harder to solve analytically but is also why they uniquely characterize $V^*$.
The optimal policy is recovered greedily: $\pi^*(s) = \arg\max_a Q^*(s,a)$. The Bellman optimality operator $\mathcal{T}^*$ defined by $(\mathcal{T}^* V)(s) = \max_a[r(s,a) + \gamma \sum_{s'} P(s'|s,a)V(s')]$ is also a $\gamma$-contraction in $\ell^\infty(\mathcal{S})$.
5. Bellman Backup — Animated
The animation below illustrates the Bellman backup: how value propagates from successor states back to the current state through the recursive equation.
6. Bellman → Dynamic Programming
The Bellman equations are not just structural — they compute. The fixed-point interpretation gives us two canonical algorithms.
6.1 Value Iteration
Start with any value function (e.g., all zeros). Repeatedly apply the Bellman optimality operator. Since $\mathcal{T}^*$ is a contraction, the sequence converges to $V^*$.
Repeat until convergence ($\|V_{k+1} - V_k\|_\infty < \epsilon$):
Convergence rate: after $k$ iterations, $\|V_k - V^*\|_\infty \le \gamma^k \|V_0 - V^*\|_\infty$. This is geometric.
6.2 Policy Iteration
Start with any policy $\pi_0$. Alternate between: (1) evaluating the current policy exactly (solve the Bellman expectation equation), and (2) improving by acting greedily w.r.t. the computed values. Each step is guaranteed not to decrease performance, and the process terminates at $\pi^*$ in finite steps (for finite MDPs).
Policy Evaluation: Solve $(I - \gamma P^{\pi_k}) V^{\pi_k} = r^{\pi_k}$ — a linear system.
Policy Improvement: $\pi_{k+1}(s) = \arg\max_a [r(s,a) + \gamma \sum_{s'} P(s'|s,a) V^{\pi_k}(s')]$.
Policy improvement theorem: $V^{\pi_{k+1}} \ge V^{\pi_k}$ pointwise. Since the policy space is finite, termination is guaranteed.
Both algorithms find fixed points of Bellman operators:
- $V^\pi$ is the unique fixed point of $\mathcal{T}^\pi$ (Bellman expectation operator)
- $V^*$ is the unique fixed point of $\mathcal{T}^*$ (Bellman optimality operator)
- Contraction mapping theorem (Banach) guarantees existence, uniqueness, and convergence
6.3 Generalized Policy Iteration
Modern RL algorithms (Actor-Critic, PPO, SAC) can be viewed as approximate generalized policy iteration: they interleave partial evaluation (TD learning, Monte Carlo) with partial improvement (gradient-based policy updates). The Bellman equation remains the core recursive structure throughout.
7. Why Continuous-Time Control?
The discrete MDP framework is elegant but has limitations in physical systems. Real robots, aircraft, financial markets, and biological systems evolve continuously — their state changes according to differential equations, not jump processes.
- State dynamics are naturally modeled as SDEs: $dX_t = f(X_t, u_t)dt + \sigma(X_t)dW_t$
- Discrete-time with small $\Delta t$ only approximates this; errors accumulate
- Continuous-time gives exact, closed-form optimality conditions
- PDE theory (viscosity solutions) handles non-smooth value functions rigorously
In continuous time, the state evolves as a controlled diffusion:
where $W_t$ is a standard Brownian motion, $u_t \in \mathcal{U}$ is the control (the analogue of action). The objective is:
where $\rho > 0$ is the continuous-time discount rate (the analogue of $-\ln\gamma$).
8. The Hamilton–Jacobi–Bellman Equation
The HJB equation is the PDE that the optimal value function $V^*$ must satisfy in continuous time. It says: at any state $x$, there is a "flow" that keeps $V^*$ consistent — the instantaneous reward plus the expected drift of value due to system dynamics must exactly balance the discount rate times $V^*$.
It is the continuous-time analogue of the Bellman optimality equation. While the Bellman equation is a system of equations (one per state), HJB is a single PDE governing the value function over the entire state space.
For the deterministic case ($\sigma = 0$), HJB is:
For the stochastic case (controlled diffusion), the infinitesimal generator $\mathcal{L}^u$ acts on $V$:
Viscosity Solutions. $V^*$ need not be smooth (it may not be differentiable everywhere). The notion of viscosity solutions (Crandall–Lions) provides the correct framework for HJB when classical derivatives fail — the unique viscosity solution is $V^*$.
9. Derivation: Bellman → HJB
This is the core of the connection. We derive the HJB equation as the continuous-time limit $\Delta t \to 0$ of the discrete-time Bellman optimality equation. We work with the deterministic case for clarity; the stochastic case adds an Itô correction term.
- Deterministic dynamics: $\dot{x}(t) = f(x(t), u(t))$
- $V^*$ is sufficiently smooth (at least $C^1$ in $x$ and $t$, $C^2$ for stochastic)
- Discount rate: $\gamma_{\Delta t} = e^{-\rho \Delta t}$ for continuous-time equivalent
- Reward accumulates as $\int_0^{\Delta t} e^{-\rho s} r(x(s), u(s)) ds \approx r(x,u)\Delta t + O(\Delta t^2)$
Start: Discrete-Time Bellman Optimality
With time step $\Delta t$, the Bellman optimality equation reads:
where we use $\gamma = e^{-\rho \Delta t}$ (so $\gamma \to 1$ as $\Delta t \to 0$, with $\rho$ fixed) and the next state is $x' = x + f(x,u)\Delta t + O(\Delta t^2)$ from Euler discretization.
Taylor Expand the Discount Factor
Expand $e^{-\rho \Delta t}$ for small $\Delta t$:
Taylor Expand $V^*$ at Next State
Expand $V^*(x + f(x,u)\Delta t)$ around $x$:
Substitute Back
Combine steps 2 and 3 in the Bellman equation:
Expanding and collecting terms up to order $\Delta t$:
Cancel $V^*(x)$ and Divide by $\Delta t$
The $V^*(x)$ terms cancel on both sides:
Divide through by $\Delta t$:
Take the Limit $\Delta t \to 0$
The $O(\Delta t)$ terms vanish, yielding the HJB equation:
This is the Hamilton–Jacobi–Bellman equation for the deterministic, infinite-horizon, discounted control problem. ∎
For stochastic dynamics $dX = f(X,u)dt + \sigma(X)dW$, the Taylor expansion of $V^*(X_{t+\Delta t})$ must include the second-order Itô term. By Itô's lemma:
The $O(dW^2) = O(dt)$ term from the quadratic variation does not vanish — it contributes at the same order as the drift. This is why the stochastic HJB gains the diffusion (Laplacian) term $\frac{1}{2}\text{tr}(\sigma\sigma^\top \nabla^2_x V^*)$, making HJB a second-order PDE rather than first-order.
Key Insight: The HJB equation is not just an approximation — it is the exact continuous-time optimality condition. Every $O(\Delta t^2)$ term vanished in the limit. The Bellman equation and HJB are the same object in different time representations.
10. Discrete vs. Continuous — Comparison
| Aspect | Bellman (Discrete) | HJB (Continuous) |
|---|---|---|
| Time | Discrete: $t = 0, 1, 2, \ldots$ | Continuous: $t \in [0, \infty)$ |
| State space | Finite or countable $\mathcal{S}$ | Continuous $\mathcal{X} \subseteq \mathbb{R}^n$ |
| Optimality condition | System of equations (one per state) | PDE over $\mathcal{X}$ |
| Structure | Nonlinear fixed-point equation | Nonlinear PDE (first/second order) |
| Discount | $\gamma \in (0,1)$ | $\rho = -\ln\gamma / \Delta t > 0$ |
| Dynamics | Transition kernel $P(s'|s,a)$ | $dX = f(X,u)dt + \sigma dW$ |
| Gradient | No spatial gradient (states are discrete) | $\nabla_x V^*$ — gradient of value function |
| Stochastic term | Expectation over $P(\cdot|s,a)$ | $\frac{1}{2}\text{tr}(\sigma\sigma^\top \nabla^2_x V^*)$ (Itô) |
| Solution notion | Classical fixed point | Viscosity solution (if non-smooth) |
| Algorithms | Value iter., policy iter., Q-learning | Pontryagin MP, LQR, DDP, neural PDE solvers |
11. Summary Table
| Equation | Formula | Type |
|---|---|---|
| Bellman-V | $V^\pi(s) = \sum_a \pi(a|s)[r(s,a) + \gamma\sum_{s'}P(s'|s,a)V^\pi(s')]$ | Linear system (expectation) |
| Bellman-Q | $Q^\pi(s,a) = r(s,a) + \gamma\sum_{s'}P(s'|s,a)\sum_{a'}\pi(a'|s')Q^\pi(s',a')$ | Linear system (expectation) |
| Opt-V | $V^*(s) = \max_a[r(s,a) + \gamma\sum_{s'}P(s'|s,a)V^*(s')]$ | Nonlinear (optimality) |
| Opt-Q | $Q^*(s,a) = r(s,a) + \gamma\sum_{s'}P(s'|s,a)\max_{a'}Q^*(s',a')$ | Nonlinear (optimality) |
| HJB | $\rho V^*(x) = \sup_u[r(x,u) + \nabla_x V^* \cdot f(x,u) + \frac{1}{2}\text{tr}(\sigma\sigma^\top\nabla^2 V^*)]$ | Nonlinear PDE |
All five equations express the same optimality principle — the value of a state equals the immediate reward plus (discounted, optimized) future value. The differences are purely representational: discrete vs. continuous time, stochastic expectation vs. differential generator, finite sum vs. PDE.
12. Remarks & Memory Aids
Common Confusions
No. $V^*$ can fail to be differentiable at boundaries of the optimal switching surface (e.g., in bang-bang control problems). The theory of viscosity solutions handles this: $V^*$ is always a viscosity solution to HJB even when classical derivatives don't exist.
Mathematically, $\gamma < 1$ ensures the Bellman operator is a strict contraction, guaranteeing a unique fixed point. For $\gamma = 1$ (undiscounted), the problem may not have a well-defined finite value function without additional ergodicity conditions. The continuous-time analogue is $\rho > 0$.
PMP is an alternative set of necessary conditions for optimality in continuous-time control, formulated via Hamiltonian mechanics and costate variables. It is related to HJB: if $V^*$ is differentiable, the costate $\lambda(t) = \nabla_x V^*(x(t))$ satisfies the costate ODE from PMP. HJB is a global sufficient condition; PMP is local/necessary. For nonsmooth problems, PMP generalizes more cleanly.
13. Historical Timeline
Click any event to expand its story.
References: Bellman (1957) Dynamic Programming · Sutton & Barto (2018) Reinforcement Learning · Fleming & Rishel (1975) Deterministic and Stochastic Optimal Control · Crandall, Ishii & Lions (1992) User's guide to viscosity solutions · Bertsekas (2012) Dynamic Programming and Optimal Control