Value Functions, Bellman Equations & HJB

Bellman Equations & the Hamilton–Jacobi–Bellman Equation

A rigorous &intuitive treatment — from discrete-time dynamic programming to the continuous-time HJB equation — for the graduate RL researcher.

Interactive derivations Animated Bellman backup Formal HJB limit proof

1. Motivation & Intuition

Imagine a chess player evaluating board positions. They don't compute the outcome of every possible game to the end — instead, they develop an intuition about how "good" each position is. This intuition is exactly the value functionA mapping from states (and optionally actions) to expected cumulative reward. Encodes how desirable a situation is under a given policy..

💡 Core Insight

The central idea of dynamic programming is that optimal sequential decisions can be decomposed into a current decision plus the optimal value of the future. This is Bellman's principle of optimality.

The Bellman equation makes this recursive: the value of being in a state equals the immediate reward you collect plus (discounted) value of the next state. This recursion is the engine of almost all model-based RL algorithms.

The Hamilton–Jacobi–Bellman equationThe continuous-time PDE analogue of the Bellman optimality equation. Characterizes the optimal value function as a solution to a partial differential equation involving the system dynamics. is what Bellman's equation becomes when time is continuous and the state evolves according to a differential equation — the natural setting for control theory and physics-based problems.

We operate in a Markov Decision Process (MDP): a tuple $(\mathcal{S}, \mathcal{A}, P, r, \gamma)$ where $\mathcal{S}$ is the state space, $\mathcal{A}$ the action space, $P(s'|s,a)$ the transition kernel, $r(s,a)$ the reward function, and $\gamma \in [0,1)$ the discount factor.

An agent follows a policy $\pi: \mathcal{S} \to \Delta(\mathcal{A})$ (stationary, possibly stochastic). The goal is to find $\pi^*$ that maximizes expected discounted return:

Objective

$$\pi^* = \arg\max_\pi \; \mathbb{E}_\pi\!\left[\sum_{t=0}^{\infty} \gamma^t r(S_t, A_t) \;\Big|\; S_0 = s\right]$$

The key insight is that this infinite-horizon problem has recursive structure — the basis for Bellman's work and subsequently the entire edifice of DP and RL.

2. Value Functions

2.1 State-Value Function $V^\pi(s)$

$V^\pi(s)$ answers: "If I'm in state $s$ and follow policy $\pi$ forever, how much total reward can I expect to accumulate?"

Think of it as the "price" of being in a state under a given behavioral strategy. Higher $V^\pi(s)$ = better state.

State-Value Function

$$V^\pi(s) \;=\; \mathbb{E}_\pi\!\left[\sum_{t=0}^{\infty} \gamma^t r(S_t, A_t) \;\Big|\; S_0 = s\right]$$

The expectation is over the stochasticity in both the policy $\pi(\cdot|s)$ and the transition kernel $P(\cdot|s,a)$.

2.2 Action-Value (Q) Function $Q^\pi(s,a)$

$Q^\pi(s,a)$ answers: "If I'm in state $s$, take action $a$ right now (even if $\pi$ wouldn't normally suggest it), then follow $\pi$ forever after — how much reward do I get?"

The Q-function is more informative than V — it tells you the value of each action, enabling direct policy improvement.

Action-Value Function

$$Q^\pi(s,a) \;=\; \mathbb{E}_\pi\!\left[\sum_{t=0}^{\infty} \gamma^t r(S_t, A_t) \;\Big|\; S_0 = s, A_0 = a\right]$$

🔗 Relationship

The two functions are related by:

$$V^\pi(s) = \sum_{a} \pi(a|s) \, Q^\pi(s,a) = \mathbb{E}_{a \sim \pi(\cdot|s)}[Q^\pi(s,a)]$$

2.3 Notation Summary

$\mathcal{S}, \mathcal{A}$

State space, action space

$P(s'|s,a)$

Transition probability kernel

$r(s,a)$

Expected immediate reward

$\gamma$

Discount factor ($0 \le \gamma < 1$)

$\pi(a|s)$

Policy — probability of action $a$ in state $s$

$V^\pi, Q^\pi$

State-value and action-value under policy $\pi$

$V^*, Q^*$

Optimal state-value and action-value

$\pi^*$

Optimal policy

3. Bellman Expectation Equations

The Bellman equations give recursive characterizations of value functions. They are the central structural property enabling computation of $V^\pi$ and $Q^\pi$.

3.1 Bellman Equation for $V^\pi$

The value of a state is the immediate reward you expect to get (by averaging over actions and next states) plus the discounted value of wherever you end up next. The value decomposes into now and everything after now.

Starting from the definition and splitting off the first term:

$$V^\pi(s) = \sum_{a} \pi(a|s) \left[ r(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V^\pi(s') \right]$$

Equivalently in operator form: $V^\pi = \mathcal{T}^\pi V^\pi$ where the Bellman operator$\mathcal{T}^\pi$ maps any function $f: \mathcal{S} \to \mathbb{R}$ to $(\mathcal{T}^\pi f)(s) = \sum_a \pi(a|s)[r(s,a) + \gamma \sum_{s'} P(s'|s,a)f(s')]$. It is a $\gamma$-contraction in $\ell^\infty$, so repeated application converges to $V^\pi$. $\mathcal{T}^\pi$ is a $\gamma$-contraction mapping whose unique fixed point is $V^\pi$.

Bellman Eq. for V

$$\boxed{V^\pi(s) = \mathbb{E}_{a \sim \pi(\cdot|s)}\!\left[r(s,a) + \gamma \, \mathbb{E}_{s' \sim P(\cdot|s,a)}\!\left[V^\pi(s')\right]\right]}$$

3.2 Bellman Equation for $Q^\pi$

Bellman Eq. for Q

$$\boxed{Q^\pi(s,a) = r(s,a) + \gamma \sum_{s'} P(s'|s,a) \sum_{a'} \pi(a'|s') \, Q^\pi(s',a')}$$

The Q-function's Bellman equation shows: take action $a$, observe reward $r(s,a)$, land in $s'$, then follow $\pi$ (averaging over $\pi$'s choice of $a'$ in $s'$).

⚡ Cross-equations

$V^\pi$ and $Q^\pi$ are interchangeable: $V^\pi(s) = \mathbb{E}_{a\sim\pi}[Q^\pi(s,a)]$ and $Q^\pi(s,a) = r(s,a) + \gamma \mathbb{E}_{s'}[V^\pi(s')]$. This symmetry drives policy gradient theorems.

4. Bellman Optimality Equations

What if we don't commit to any fixed policy, but instead ask: what is the best possible value achievable from each state? The optimal value function$V^*(s) = \max_\pi V^\pi(s)$. The envelope over all policies. Similarly, $Q^*(s,a) = \max_\pi Q^\pi(s,a)$. $V^*$ is the ceiling — no policy can do better.

The Bellman optimality equation says: at each state, you get to pick the best action, then nature picks the next state.

Define $V^*(s) = \max_\pi V^\pi(s)$ and $Q^*(s,a) = \max_\pi Q^\pi(s,a)$. By Bellman's principle of optimality:

Bellman Optimality — V

$$\boxed{V^*(s) = \max_{a \in \mathcal{A}} \left[ r(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V^*(s') \right]}$$

Bellman Optimality — Q

$$\boxed{Q^*(s,a) = r(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q^*(s',a')}$$

⚠️

Unlike the Bellman expectation equations (which are linear in $V^\pi$), the optimality equations are nonlinear due to the $\max$ operator. This nonlinearity is what makes them harder to solve analytically but is also why they uniquely characterize $V^*$.

The optimal policy is recovered greedily: $\pi^*(s) = \arg\max_a Q^*(s,a)$. The Bellman optimality operator $\mathcal{T}^*$ defined by $(\mathcal{T}^* V)(s) = \max_a[r(s,a) + \gamma \sum_{s'} P(s'|s,a)V(s')]$ is also a $\gamma$-contraction in $\ell^\infty(\mathcal{S})$.

5. Bellman Backup — Animated

The animation below illustrates the Bellman backup: how value propagates from successor states back to the current state through the recursive equation.

Step 0 — Ready

6. Bellman → Dynamic Programming

The Bellman equations are not just structural — they compute. The fixed-point interpretation gives us two canonical algorithms.

6.1 Value Iteration

Start with any value function (e.g., all zeros). Repeatedly apply the Bellman optimality operator. Since $\mathcal{T}^*$ is a contraction, the sequence converges to $V^*$.

Repeat until convergence ($\|V_{k+1} - V_k\|_\infty < \epsilon$):

$$V_{k+1}(s) \leftarrow \max_{a} \left[ r(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V_k(s') \right], \quad \forall s \in \mathcal{S}$$

Convergence rate: after $k$ iterations, $\|V_k - V^*\|_\infty \le \gamma^k \|V_0 - V^*\|_\infty$. This is geometric.

6.2 Policy Iteration

Start with any policy $\pi_0$. Alternate between: (1) evaluating the current policy exactly (solve the Bellman expectation equation), and (2) improving by acting greedily w.r.t. the computed values. Each step is guaranteed not to decrease performance, and the process terminates at $\pi^*$ in finite steps (for finite MDPs).

Policy Evaluation: Solve $(I - \gamma P^{\pi_k}) V^{\pi_k} = r^{\pi_k}$ — a linear system.

Policy Improvement: $\pi_{k+1}(s) = \arg\max_a [r(s,a) + \gamma \sum_{s'} P(s'|s,a) V^{\pi_k}(s')]$.

Policy improvement theorem: $V^{\pi_{k+1}} \ge V^{\pi_k}$ pointwise. Since the policy space is finite, termination is guaranteed.

🔭 Fixed-Point View

Both algorithms find fixed points of Bellman operators:

$V^\pi$ is the unique fixed point of $\mathcal{T}^\pi$ (Bellman expectation operator)
$V^*$ is the unique fixed point of $\mathcal{T}^*$ (Bellman optimality operator)
Contraction mapping theorem (Banach) guarantees existence, uniqueness, and convergence

6.3 Generalized Policy Iteration

Modern RL algorithms (Actor-Critic, PPO, SAC) can be viewed as approximate generalized policy iteration: they interleave partial evaluation (TD learning, Monte Carlo) with partial improvement (gradient-based policy updates). The Bellman equation remains the core recursive structure throughout.

7. Why Continuous-Time Control?

The discrete MDP framework is elegant but has limitations in physical systems. Real robots, aircraft, financial markets, and biological systems evolve continuously — their state changes according to differential equations, not jump processes.

🌊 Why Continuous Time Matters

State dynamics are naturally modeled as SDEs: $dX_t = f(X_t, u_t)dt + \sigma(X_t)dW_t$
Discrete-time with small $\Delta t$ only approximates this; errors accumulate
Continuous-time gives exact, closed-form optimality conditions
PDE theory (viscosity solutions) handles non-smooth value functions rigorously

In continuous time, the state evolves as a controlled diffusion:

$$dX_t = f(X_t, u_t) \, dt + \sigma(X_t) \, dW_t, \quad X_0 = x$$

where $W_t$ is a standard Brownian motion, $u_t \in \mathcal{U}$ is the control (the analogue of action). The objective is:

$$V^*(x) = \sup_{u(\cdot)} \mathbb{E}\!\left[\int_0^\infty e^{-\rho t} r(X_t, u_t) \, dt \;\Big|\; X_0 = x\right]$$

where $\rho > 0$ is the continuous-time discount rate (the analogue of $-\ln\gamma$).

8. The Hamilton–Jacobi–Bellman Equation

The HJB equation is the PDE that the optimal value function $V^*$ must satisfy in continuous time. It says: at any state $x$, there is a "flow" that keeps $V^*$ consistent — the instantaneous reward plus the expected drift of value due to system dynamics must exactly balance the discount rate times $V^*$.

It is the continuous-time analogue of the Bellman optimality equation. While the Bellman equation is a system of equations (one per state), HJB is a single PDE governing the value function over the entire state space.

For the deterministic case ($\sigma = 0$), HJB is:

$$\rho \, V^*(x) = \sup_{u \in \mathcal{U}} \left[ r(x,u) + \nabla_x V^*(x)^\top f(x,u) \right]$$

For the stochastic case (controlled diffusion), the infinitesimal generator $\mathcal{L}^u$ acts on $V$:

HJB — Stochastic

$$\rho \, V^*(x) = \sup_{u \in \mathcal{U}} \left[ r(x,u) + \mathcal{L}^u V^*(x) \right]$$ where $\mathcal{L}^u V = \nabla_x V \cdot f(x,u) + \tfrac{1}{2} \text{tr}\!\left(\sigma \sigma^\top \nabla^2_x V\right)$

HJB Equation (General Form)

$$\boxed{\rho \, V^*(x) = \sup_{u \in \mathcal{U}} \left[ r(x,u) + \nabla_x V^*(x)^\top f(x,u) + \tfrac{1}{2}\operatorname{tr}\!\left(\sigma(x)\sigma(x)^\top \nabla^2_x V^*(x)\right) \right]}$$

ℹ️

Viscosity Solutions. $V^*$ need not be smooth (it may not be differentiable everywhere). The notion of viscosity solutions (Crandall–Lions) provides the correct framework for HJB when classical derivatives fail — the unique viscosity solution is $V^*$.

9. Derivation: Bellman → HJB

This is the core of the connection. We derive the HJB equation as the continuous-time limit $\Delta t \to 0$ of the discrete-time Bellman optimality equation. We work with the deterministic case for clarity; the stochastic case adds an Itô correction term.

🔎 Assumptions for the Derivation ▼

Deterministic dynamics: $\dot{x}(t) = f(x(t), u(t))$
$V^*$ is sufficiently smooth (at least $C^1$ in $x$ and $t$, $C^2$ for stochastic)
Discount rate: $\gamma_{\Delta t} = e^{-\rho \Delta t}$ for continuous-time equivalent
Reward accumulates as $\int_0^{\Delta t} e^{-\rho s} r(x(s), u(s)) ds \approx r(x,u)\Delta t + O(\Delta t^2)$

Start: Discrete-Time Bellman Optimality

With time step $\Delta t$, the Bellman optimality equation reads:

$$V^*(x) = \max_{u} \left[ r(x,u)\Delta t + e^{-\rho \Delta t} V^*(x + f(x,u)\Delta t) \right]$$

where we use $\gamma = e^{-\rho \Delta t}$ (so $\gamma \to 1$ as $\Delta t \to 0$, with $\rho$ fixed) and the next state is $x' = x + f(x,u)\Delta t + O(\Delta t^2)$ from Euler discretization.

Taylor Expand the Discount Factor

Expand $e^{-\rho \Delta t}$ for small $\Delta t$:

$$e^{-\rho \Delta t} = 1 - \rho \Delta t + O(\Delta t^2)$$

Taylor Expand $V^*$ at Next State

Expand $V^*(x + f(x,u)\Delta t)$ around $x$:

$$V^*(x + f(x,u)\Delta t) = V^*(x) + \nabla_x V^*(x)^\top f(x,u) \Delta t + O(\Delta t^2)$$

Substitute Back

Combine steps 2 and 3 in the Bellman equation:

\begin{align*} V^*(x) &= \max_u \Big[ r(x,u)\Delta t + (1 - \rho\Delta t + O(\Delta t^2)) \\ &\quad\quad\quad\quad \cdot (V^*(x) + \nabla_x V^*(x)^\top f(x,u)\Delta t + O(\Delta t^2)) \Big] \end{align*}

Expanding and collecting terms up to order $\Delta t$:

$$V^*(x) = \max_u \Big[ r(x,u)\Delta t + V^*(x) - \rho V^*(x)\Delta t + \nabla_x V^*(x)^\top f(x,u)\Delta t \Big] + O(\Delta t^2)$$

Cancel $V^*(x)$ and Divide by $\Delta t$

The $V^*(x)$ terms cancel on both sides:

$$0 = \max_u \Big[ r(x,u)\Delta t - \rho V^*(x)\Delta t + \nabla_x V^*(x)^\top f(x,u)\Delta t \Big] + O(\Delta t^2)$$

Divide through by $\Delta t$:

$$0 = \max_u \left[ r(x,u) + \nabla_x V^*(x)^\top f(x,u) \right] - \rho V^*(x) + O(\Delta t)$$

Take the Limit $\Delta t \to 0$

The $O(\Delta t)$ terms vanish, yielding the HJB equation:

$$\boxed{\rho V^*(x) = \max_{u \in \mathcal{U}} \left[ r(x,u) + \nabla_x V^*(x)^\top f(x,u) \right]}$$

This is the Hamilton–Jacobi–Bellman equation for the deterministic, infinite-horizon, discounted control problem. ∎

📐 Stochastic Extension (Itô's Lemma) ▼

For stochastic dynamics $dX = f(X,u)dt + \sigma(X)dW$, the Taylor expansion of $V^*(X_{t+\Delta t})$ must include the second-order Itô term. By Itô's lemma:

$$dV^* = \nabla_x V^* \cdot dX + \tfrac{1}{2} \text{tr}(\sigma\sigma^\top \nabla^2_x V^*) dt$$

The $O(dW^2) = O(dt)$ term from the quadratic variation does not vanish — it contributes at the same order as the drift. This is why the stochastic HJB gains the diffusion (Laplacian) term $\frac{1}{2}\text{tr}(\sigma\sigma^\top \nabla^2_x V^*)$, making HJB a second-order PDE rather than first-order.

🎯

Key Insight: The HJB equation is not just an approximation — it is the exact continuous-time optimality condition. Every $O(\Delta t^2)$ term vanished in the limit. The Bellman equation and HJB are the same object in different time representations.

10. Discrete vs. Continuous — Comparison

Aspect	Bellman (Discrete)	HJB (Continuous)
Time	Discrete: $t = 0, 1, 2, \ldots$	Continuous: $t \in [0, \infty)$
State space	Finite or countable $\mathcal{S}$	Continuous $\mathcal{X} \subseteq \mathbb{R}^n$
Optimality condition	System of equations (one per state)	PDE over $\mathcal{X}$
Structure	Nonlinear fixed-point equation	Nonlinear PDE (first/second order)
Discount	$\gamma \in (0,1)$	$\rho = -\ln\gamma / \Delta t > 0$
Dynamics	Transition kernel $P(s'\|s,a)$	$dX = f(X,u)dt + \sigma dW$
Gradient	No spatial gradient (states are discrete)	$\nabla_x V^*$ — gradient of value function
Stochastic term	Expectation over $P(\cdot\|s,a)$	$\frac{1}{2}\text{tr}(\sigma\sigma^\top \nabla^2_x V^*)$ (Itô)
Solution notion	Classical fixed point	Viscosity solution (if non-smooth)
Algorithms	Value iter., policy iter., Q-learning	Pontryagin MP, LQR, DDP, neural PDE solvers

11. Summary Table

Equation	Formula	Type
Bellman-V	$V^\pi(s) = \sum_a \pi(a\|s)[r(s,a) + \gamma\sum_{s'}P(s'\|s,a)V^\pi(s')]$	Linear system (expectation)
Bellman-Q	$Q^\pi(s,a) = r(s,a) + \gamma\sum_{s'}P(s'\|s,a)\sum_{a'}\pi(a'\|s')Q^\pi(s',a')$	Linear system (expectation)
Opt-V	$V^(s) = \max_a[r(s,a) + \gamma\sum_{s'}P(s'\|s,a)V^(s')]$	Nonlinear (optimality)
Opt-Q	$Q^(s,a) = r(s,a) + \gamma\sum_{s'}P(s'\|s,a)\max_{a'}Q^(s',a')$	Nonlinear (optimality)
HJB	$\rho V^(x) = \sup_u[r(x,u) + \nabla_x V^ \cdot f(x,u) + \frac{1}{2}\text{tr}(\sigma\sigma^\top\nabla^2 V^*)]$	Nonlinear PDE

🔑 Connecting Thread

All five equations express the same optimality principle — the value of a state equals the immediate reward plus (discounted, optimized) future value. The differences are purely representational: discrete vs. continuous time, stochastic expectation vs. differential generator, finite sum vs. PDE.

12. Remarks & Memory Aids

📌 Bellman = Recursion The Bellman equation is at heart a recursion: value now = reward now + discounted value later. Everything else is a consequence.

📌 Expectation vs. Optimality Bellman expectation evaluates a policy (linear). Bellman optimality finds the best policy (nonlinear due to max). Don't confuse them.

📌 HJB = Bellman in PDE form If you understand the derivation (Section 9), you see HJB is just the $\Delta t \to 0$ limit of Bellman optimality. No magic — just calculus.

📌 V vs. Q $Q$ is richer: it gives value per action, enabling greedy policy extraction without knowing $P$. This is why Q-learning works model-free.

📌 Contraction is key Bellman operators are $\gamma$-contractions. This single fact guarantees existence, uniqueness, and convergence of DP algorithms.

📌 Stochastic HJB has 2nd order Deterministic HJB is a 1st-order PDE (no $\nabla^2$). Stochastic dynamics (diffusions) add the Itô term, making it 2nd-order — much harder to solve analytically.

Common Confusions

❓ "Is $V^*$ always differentiable?" ▼

No. $V^*$ can fail to be differentiable at boundaries of the optimal switching surface (e.g., in bang-bang control problems). The theory of viscosity solutions handles this: $V^*$ is always a viscosity solution to HJB even when classical derivatives don't exist.

❓ "Why does $\gamma < 1$ matter so much?" ▼

Mathematically, $\gamma < 1$ ensures the Bellman operator is a strict contraction, guaranteeing a unique fixed point. For $\gamma = 1$ (undiscounted), the problem may not have a well-defined finite value function without additional ergodicity conditions. The continuous-time analogue is $\rho > 0$.

❓ "What is the Pontryagin Maximum Principle?" ▼

PMP is an alternative set of necessary conditions for optimality in continuous-time control, formulated via Hamiltonian mechanics and costate variables. It is related to HJB: if $V^*$ is differentiable, the costate $\lambda(t) = \nabla_x V^*(x(t))$ satisfies the costate ODE from PMP. HJB is a global sufficient condition; PMP is local/necessary. For nonsmooth problems, PMP generalizes more cleanly.

13. Historical Timeline

Click any event to expand its story.

1944–1957

Richard Bellman & Dynamic Programming

Richard Bellman at RAND Corporation formulated dynamic programming (1953–1957), introducing the principle of optimality and the recursive decomposition of multi-stage decision problems. His 1957 book "Dynamic Programming" established the field. The name "curse of dimensionality" — also his — captures the exponential blowup in state space that remains a core challenge today.

1958

Hamilton–Jacobi Theory Extended

The Hamilton–Jacobi equation from classical mechanics (1830s, Hamilton and Jacobi) was extended by Bellman and others to optimal control. The "Bellman" contribution merged the HJ formalism with the principle of optimality, creating the HJB equation. For deterministic systems, HJB recovers the HJ equation of mechanics when $r = 0$ and terminal cost is a Hamiltonian.

1956–1962

Pontryagin Maximum Principle

L.S. Pontryagin and his group in Moscow developed the Maximum Principle (1956–1962), a set of necessary conditions for optimal control using Hamiltonian mechanics and adjoint (costate) variables. PMP and HJB are complementary: PMP gives trajectory-level conditions, HJB gives a global value function. The connection is $\lambda(t) = \nabla_x V^*(x(t))$.

1966–1973

LQR and Riccati Equations

Linear-Quadratic Regulator (LQR) — the canonical analytically-solvable control problem — has HJB reduce to the algebraic Riccati equation. This gives a closed-form optimal value function $V^*(x) = x^\top P x$ (quadratic) and linear optimal policy $u^* = -K x$. LQR remains the gold standard for understanding HJB structure and the foundation for iLQR and DDP.

1983

Viscosity Solutions (Crandall–Lions)

Michael Crandall and Pierre-Louis Lions introduced viscosity solutions for first-order nonlinear PDEs, providing the correct solution concept for HJB when $V^*$ is non-differentiable. This resolved long-standing theoretical questions about existence, uniqueness, and convergence. Lions received the Fields Medal in 1994 partly for this work.

1988–1992

Q-Learning & Temporal Difference Learning

Watkins (1989) introduced Q-learning, the first model-free RL algorithm directly grounded in the Bellman optimality equation for Q-functions. Sutton & Barto's TD($\lambda$) (1988) connected Bellman expectations to bootstrapped value estimation. These algorithms enabled large-scale RL by sampling from the Bellman equation rather than solving it exactly.

2013–2024

Deep RL and Neural HJB Solvers

Deep Q-Networks (Mnih et al., 2013), Policy Gradient methods, and deep Actor-Critic algorithms approximate Bellman operators via neural networks. Concurrently, physics-informed neural networks and deep Galerkin methods tackle high-dimensional HJB PDEs numerically. The circle closes: the theoretical framework from 1957 now powers systems that play Go and control robotic hands.

References: Bellman (1957) Dynamic Programming · Sutton & Barto (2018) Reinforcement Learning · Fleming & Rishel (1975) Deterministic and Stochastic Optimal Control · Crandall, Ishii & Lions (1992) User's guide to viscosity solutions · Bertsekas (2012) Dynamic Programming and Optimal Control