Reinforcement Learning — Deep Dive

Function Approximation
in Reinforcement Learning

Research-level Interactive PhD / MTech ~45 min read

A rigorous, interactive treatment of function approximation in RL — from the curse of dimensionality through deep RL and offline learning. Every concept is developed from intuition to formal mathematics, with interactive visualizations throughout.

§1 Motivation & Intuition

The Curse of Dimensionality

Classical tabular RL — Q-tables, V-tables — represents the value function as a lookup table with one entry per state (or state-action pair). This is elegant and convergence-guaranteed, but collapses catastrophically as state spaces grow.

Imagine a robot navigating a room. If you discretize position into a 100×100 grid, that's 10,000 states — manageable. Add orientation (360 degrees), joint angles for each of 7 DOF, and object states, and you quickly reach states numbering in the trillions. You could never visit them all, let alone store a Q-value for each.

The curse of dimensionality says that the volume of space grows exponentially with dimension. If each feature has \(k\) values, then \(d\) features require \(k^d\) table entries. For \(k=10, d=20\), that's \(10^{20}\) — more than the number of atoms in your body.

Suppose the state space is a grid over \(d\) continuous variables, each discretized into \(k\) bins. The table size is:

|\mathcal{S}| = k^d \quad \Rightarrow \quad \text{memory} = \mathcal{O}(k^d)

For \(k=100, d=10\): \(10^{20}\) entries. Even at 1 byte per entry, storage exceeds \(10^8\) terabytes. Moreover, sample complexity scales proportionally — we need \(\Omega(|\mathcal{S}|)\) samples simply to visit every state.

Curse of Dimensionality

Table entries grow exponentially with state dimension

Bins per dimension (k): k=5

Why Tabular RL Fails

Three fundamental failures of tabular RL in real-world problems:

💾

Storage

Storing a Q-table for continuous state-action spaces is computationally infeasible.

📊

Sample Efficiency

Each state must be visited many times to estimate its value accurately. Most states are never visited.

🔁

Generalization

Tabular methods cannot generalize across similar states. Knowing V(s₁) tells you nothing about V(s₂), even if s₁ ≈ s₂.

The Need for Generalization

The key insight is that nearby states should have similar values. A robot 1mm to the left vs 1mm to the right of a door threshold has nearly the same long-term reward potential. Function approximation exploits this smoothness by representing \(V\) or \(Q\) as a parametric function — learning generalizes across the entire state space from a subset of samples.

🎯 Core Idea

Instead of storing \(V(s)\) for every \(s\), represent it as \(V(s;\theta)\) for a compact parameter vector \(\theta \in \mathbb{R}^d\) with \(d \ll |\mathcal{S}|\). Updating \(\theta\) on one state automatically updates predictions for all similar states.

§2 Formal Setup

The MDP Framework

We work within a Markov Decision Process \(\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma)\).

\(\mathcal{S}\)

State Space

Can be discrete, continuous, or mixed. Often \(\mathcal{S} \subseteq \mathbb{R}^n\).

\(\mathcal{A}\)

Action Space

Discrete or continuous. Continuous actions require special treatment.

\(P(s'|s,a)\)

Transition Dynamics

Probability distribution over next states. Often unknown; must be learned or bypassed.

\(R(s,a)\)

Reward Function

Expected immediate reward. May also be stochastic: \(R(s,a,s')\).

\(\gamma\)

Discount Factor

\(\gamma \in [0,1)\). Makes future rewards geometrically less important. Ensures convergence of value sums.

The Objects We Approximate

Function approximation can target three fundamental objects:

State Value Function V(s; θ): How good is it to be in state \(s\)? This estimates the expected total discounted reward from state \(s\) under a given policy. Useful for policy evaluation and improvement.

Action-Value Function Q(s,a; θ): How good is it to take action \(a\) in state \(s\)? This allows greedy policy extraction without knowing the model.

Policy π(a|s; θ): Directly represent the policy as a parametric distribution over actions. Allows for smooth, stochastic policies which can be optimized via gradient ascent.

State-Value Function $$V^\pi(s) = \mathbb{E}_\pi\!\left[\sum_{t=0}^\infty \gamma^t R(s_t, a_t) \;\Big|\; s_0 = s\right]$$

Action-Value Function $$Q^\pi(s,a) = \mathbb{E}_\pi\!\left[\sum_{t=0}^\infty \gamma^t R(s_t, a_t) \;\Big|\; s_0 = s, a_0 = a\right]$$

Parameterized Policy $$\pi_\theta(a|s) = \frac{\exp(f_\theta(s,a))}{\sum_{a'}\exp(f_\theta(s,a'))} \quad \text{(softmax policy)}$$

§3 Function Classes

Linear Function Approximation

The most theoretically tractable class. The approximated value function is linear in the feature representation of states:

\hat{V}(s;\theta) = \theta^\top \phi(s) = \sum_{j=1}^d \theta_j \phi_j(s)

where \(\phi: \mathcal{S} \to \mathbb{R}^d\) is a feature map and \(\theta \in \mathbb{R}^d\) are learnable weights. The key property: the function is linear in \(\theta\), not necessarily in \(s\) — the features \(\phi(s)\) can be highly nonlinear.

✓ Advantages of Linear FA

Convergence guarantees with TD-learning (under on-policy). Unique global minimum (convex optimization). Efficient computation — only \(d\) parameters. Clear geometric interpretation as projection onto the linear function space.

⚠ Limitations

Representational capacity is fundamentally limited — the true value function may not be linearly realizable. Performance is heavily dependent on feature engineering. Cannot model interaction effects without explicitly including them.

Basis Functions

Polynomial Basis

For \(s \in \mathbb{R}^n\), polynomial features: \(\phi(s) = [1, s_1, s_2, \ldots, s_1^2, s_1 s_2, \ldots]\). Degree-\(k\) polynomials give \(\binom{n+k}{k}\) features — still exponential in \(n\).

Fourier Basis

Periodic features that capture oscillatory structure:

\phi_i(s) = \cos\!\left(\pi \mathbf{c}_i^\top s\right), \quad \mathbf{c}_i \in \mathbb{Z}^n

Radial Basis Functions (RBF)

Gaussian bumps centered at prototype states \(\{s_1, \ldots, s_d\}\):

\phi_i(s) = \exp\!\left(-\frac{\|s - s_i\|^2}{2\sigma^2}\right)

Interactive Basis Functions

Approximating a target function with different basis sets

Basis functions (d): d=5

Neural Networks

Neural networks are universal function approximators — given sufficient width/depth, they can approximate any continuous function on a compact domain to arbitrary precision (Universal Approximation Theorem).

\hat{V}(s;\theta) = f_L \circ f_{L-1} \circ \cdots \circ f_1(s)$$ $$f_\ell(x) = \sigma(W_\ell x + b_\ell), \quad \sigma = \text{ReLU, tanh, ...}

In practice, neural networks provide extraordinary expressiveness but come with optimization challenges: non-convex loss landscapes, saddle points, and — critically in RL — non-stationarity of training targets.

⚠ Neural Networks in RL: Key Challenges

Unlike supervised learning, RL training targets (e.g., Bellman targets) are themselves functions of the current network weights. This creates a moving-target problem that can destabilize learning. The three elements of the Deadly Triad amplify this.

§4 Value Function Approximation

The Projection View

Let \(\mathcal{F} = \{\hat{V}(\cdot;\theta) : \theta \in \mathbb{R}^d\}\) be our function class. For linear FA, \(\mathcal{F}\) is a \(d\)-dimensional linear subspace of the space of all functions \(\mathcal{S} \to \mathbb{R}\).

Think of the space of all possible value functions as a very high-dimensional space. Our parameterized function class \(\mathcal{F}\) is a low-dimensional manifold (or subspace, for linear FA) within this space. The true value function \(V^\pi\) lives somewhere in this ambient space, often outside \(\mathcal{F}\).

The best we can do is find the point in \(\mathcal{F}\) closest to \(V^\pi\) — this is the projection \(\Pi V^\pi\). The distance between \(V^\pi\) and \(\Pi V^\pi\) is the approximation error, which we cannot reduce by training — it's the price we pay for using a restricted function class.

Define the \(\mu\)-weighted \(L^2\) norm: \(\|V\|_\mu^2 = \sum_s \mu(s) V(s)^2\), where \(\mu\) is a state distribution (often the on-policy distribution).

The projection operator \(\Pi\) onto the linear function space \(\mathcal{F}\) is:

\Pi V = \Phi(\Phi^\top D \Phi)^{-1} \Phi^\top D V

where \(\Phi \in \mathbb{R}^{|\mathcal{S}| \times d}\) is the feature matrix and \(D = \text{diag}(\mu)\). This is a weighted least-squares projection.

The mean squared value error is:

\overline{VE}(\theta) = \|V^\pi - \hat{V}(\cdot;\theta)\|_\mu^2 = \sum_s \mu(s)\left[V^\pi(s) - \theta^\top\phi(s)\right]^2

Projection of True Value Function

Best linear approximation in a 2D function space

The black curve is the true value function \(V^\pi(s)\). The blue line is the best linear approximation \(\Pi V^\pi\).

The red shaded region shows the approximation error — the component of \(V^\pi\) orthogonal to the function class.

No amount of training can reduce this error. It's the bias of the function class.

The Bellman Operator

The Bellman operatorThe Bellman operator 𝒯 maps value functions to value functions. Its fixed point is the true value function V*. Key property: it is a γ-contraction under the max-norm. \(\mathcal{T}^\pi\) is central to all RL algorithms. For policy \(\pi\):

(\mathcal{T}^\pi V)(s) = \sum_a \pi(a|s)\left[R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s')\right]

Key property: \(\mathcal{T}^\pi\) is a \(\gamma\)-contraction under the \(L^\infty\) norm, meaning repeated application converges to \(V^\pi\):

\|\mathcal{T}^\pi V - \mathcal{T}^\pi U\|_\infty \leq \gamma \|V - U\|_\infty

The Deadly Triad

Sutton and Barto identified three conditions whose simultaneous presence leads to instability and divergence in RL with function approximation.

💀 The Deadly Triad

Any two elements together are safe. All three together can cause divergence — the approximated values grow without bound, even for simple MDPs (Baird's Counterexample). This is not a bug but a fundamental theoretical limitation.

Baird (1995) constructed a simple 7-state MDP where semi-gradient TD with linear FA and off-policy sampling diverges. The state space has 7 states, linear features, and a specific off-policy behavior policy.

Let \(\phi(s_i) = e_i\) (standard basis), except \(\phi(s_7) = 2\mathbf{1}\). The target policy transitions always to \(s_7\), but the behavior policy transitions uniformly at random. Under these conditions, the TD update matrix has eigenvalues with positive real parts, causing exponential growth of \(\|\theta_t\|\).

Formally, the TD update is \(\theta_{t+1} = \theta_t + \alpha \delta_t \phi(s_t)\), where \(\delta_t\) is the TD error. The expected update in matrix form is \(\mathbb{E}[\theta_{t+1}] = (I - \alpha A)\theta_t + \alpha b\), where \(A = \Phi^\top D(I - \gamma P^\pi)\Phi\). When \(A\) is not positive definite (which happens with off-policy \(D\)), the iteration diverges.

§5 Core Algorithms

TD(0) with Function Approximation

TD(0) is the simplest temporal difference method. After each transition \((s, a, r, s')\), we update our value estimate: the prediction for \(s\) should be consistent with the reward we received plus the discounted prediction for \(s'\). We're essentially doing stochastic gradient descent toward a bootstrap target.

The "temporal difference" is the error between our current prediction and this bootstrap target. It's like learning that if you know the value of tomorrow, you can update today's value estimate with just a one-step look-ahead, without needing the full episode return.

The TD target is \(y_t = r_t + \gamma \hat{V}(s_{t+1};\theta_t)\). The TD error is:

\delta_t = r_t + \gamma \hat{V}(s_{t+1};\theta_t) - \hat{V}(s_t;\theta_t)

If we were doing true gradient descent on \(\overline{VE}(\theta)\), the gradient w.r.t. \(\theta\) would include a term from differentiating the target, which we drop. This gives the semi-gradient update:

\theta_{t+1} = \theta_t + \alpha \delta_t \nabla_\theta \hat{V}(s_t;\theta_t)

For linear FA, \(\nabla_\theta \hat{V}(s;\theta) = \phi(s)\), so: \(\theta_{t+1} = \theta_t + \alpha \delta_t \phi(s_t)\)

⚙ Algorithm: Semi-Gradient TD(0) for Policy Evaluation

Input: Policy π, feature map φ, learning rate α

Initialize: θ ← 0

for each episode:

s ← s₀ (initial state)

while s is not terminal:

a ~ π(·|s)

Take action a; observe r, s'

// Compute TD error

δ ← r + γ·θᵀφ(s') − θᵀφ(s)

// Semi-gradient update

θ ← θ + α·δ·φ(s)

s ← s'

⚠ Why "Semi-Gradient"?

TD(0) is NOT a true gradient method. The target \(r + \gamma \hat{V}(s';\theta)\) depends on \(\theta\), so the true gradient of the loss includes \(-\gamma \nabla_\theta \hat{V}(s';\theta)\). TD drops this term, treating the target as fixed. This makes the update biased but stable (under on-policy conditions).

Convergence of Semi-Gradient TD

📐 Theorem (Tsitsiklis & Van Roy, 1997)

Under on-policy sampling, linear function approximation, and step-sizes satisfying the Robbins-Monro conditions (\(\sum \alpha_t = \infty\), \(\sum \alpha_t^2 < \infty\)), TD(0) converges almost surely to \(\theta^*\) where:

\overline{VE}(\theta^*) \leq \frac{1}{1-\gamma} \min_\theta \overline{VE}(\theta)

This is a bounded approximation error result — TD finds a solution within a \(\frac{1}{1-\gamma}\) factor of the best possible in the function class.

The expected TD update direction is \(\mathbb{E}[\delta_t \phi(s_t)] = -A\theta + b\) where:

A = \Phi^\top D(I - \gamma P^\pi)\Phi, \quad b = \Phi^\top D r

where \(D = \text{diag}(\mu)\) with \(\mu\) the on-policy stationary distribution, and \(P^\pi\) the policy transition matrix.

Under on-policy sampling, \(D\) satisfies \(\mu^\top P^\pi = \mu^\top\), making \(\mu\) a stationary distribution. One can show \(A\) is positive definite (PD), which is the key condition for stability.

The ODE \(\dot{\theta} = -A\theta + b\) has a unique stable fixed point \(\theta^* = A^{-1}b\). By the ODE method (Benaïm, Borkar), stochastic approximation convergence follows. The bound follows from the projection error analysis.

Gradient TD Methods (GTD2, TDC)

To address off-policy divergence, Sutton et al. (2009) developed true gradient methods that converge under off-policy sampling by minimizing the projected Bellman errorThe Projected Bellman Error (PBE) is ‖ΠTᵖV − V‖²_μ. Unlike VE, its gradient includes terms from differentiating the TD target, making it a true gradient..

The insight behind GTD is to minimize the Projected Bellman Error (PBE) rather than the Mean Squared TD Error. PBE has a true gradient that can be computed efficiently. The trick is to introduce a secondary weight vector \(w\) that helps estimate the gradient direction. This doubles the number of parameters but enables principled off-policy learning.

The objective is the Projected Bellman Error (PBE):

\overline{PBE}(\theta) = \|\Pi \mathcal{T}^\pi \hat{V}(\cdot;\theta) - \hat{V}(\cdot;\theta)\|_\mu^2

GTD2: maintains secondary weights \(w\) and updates:

w_{t+1} = w_t + \beta (\delta_t - \phi(s_t)^\top w_t)\phi(s_t)

\theta_{t+1} = \theta_t + \alpha \rho_t (\phi(s_t) - \gamma\phi(s_{t+1}))(\phi(s_t)^\top w_t)

where \(\rho_t = \pi(a_t|s_t)/b(a_t|s_t)\) is the importance sampling ratio for off-policy correction.

Fitted Value Iteration

An alternative: treat value estimation as a supervised regression problem. In each iteration, use the current value function to generate targets, then fit a new function.

⚙ Algorithm: Fitted Value Iteration (FVI)

Input: Dataset 𝒟 = {(s, a, r, s')}, function class ℱ

Initialize: Q̂₀ ← 0

for k = 0, 1, 2, ..., K-1:

// Compute regression targets

yᵢ ← rᵢ + γ max_a Q̂ₖ(s'ᵢ, a) for each i

// Supervised regression

Q̂ₖ₊₁ ← argmin_{f∈ℱ} Σᵢ (f(sᵢ, aᵢ) − yᵢ)²

return Q̂ₖ

📝 Key Property

FVI decouples the function approximation step from the RL structure. Any regression oracle can be used — neural networks, random forests, etc. However, off-policy data and function approximation error interact to prevent clean convergence guarantees.

§6 Q-Learning with Function Approximation

Semi-Gradient Q-Learning

Q-learning directly estimates the optimal Q-function \(Q^*\). With function approximation:

\delta_t = r_t + \gamma \max_{a'} \hat{Q}(s_{t+1}, a';\theta_t) - \hat{Q}(s_t, a_t;\theta_t)

\theta_{t+1} = \theta_t + \alpha \delta_t \nabla_\theta \hat{Q}(s_t, a_t;\theta_t)

⚠ Divergence with Off-Policy Q-Learning

Q-learning is inherently off-policy (uses max over actions, not the current behavior policy). Combined with function approximation, this hits the Deadly Triad. Tabular Q-learning converges; Q-learning + FA can diverge wildly.

Deep Q-Network (DQN)

Mnih et al. (2015) made Q-learning with neural networks work in practice via two key tricks:

🔄

Experience Replay

Store transitions in a replay buffer \(\mathcal{D}\). Sample mini-batches at random. Breaks temporal correlations, improves data efficiency, and enables multiple gradient steps per environment step.

🎯

Target Network

Maintain a separate "target network" \(\theta^-\) for computing Bellman targets, updated slowly (every \(C\) steps). Stabilizes the moving-target problem. Loss becomes:

\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\!\left[\left(r + \gamma \max_{a'}\hat{Q}(s',a';\theta^-) - \hat{Q}(s,a;\theta)\right)^2\right]

DQN Training Dynamics

Observe the stabilizing effect of target network + replay

Target Network Experience Replay

DQN Variants

Algorithm	Key Innovation	Addresses
Double DQN	Separate networks for action selection and evaluation	Overestimation bias in max operator
Dueling DQN	Split stream for V(s) and A(s,a) = Q(s,a) - V(s)	Inefficient value estimation
Prioritized ER	Sample high-TD-error transitions more often	Sample efficiency
Rainbow	Combination of above + distributional RL + noisy nets	Multiple biases simultaneously
C51 / QR-DQN	Learn full return distribution, not just expectation	Risk-sensitivity, better gradients

§7 Policy Gradient & Actor-Critic

REINFORCE

Instead of learning a value function and deriving a policy, directly optimize the policy parameters. The objective is the expected return \(J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[G(\tau)]\). We want to perform gradient ascent on \(J(\theta)\).

The trick: we can write the gradient of \(J\) as an expectation under the policy itself, which we can estimate by collecting rollouts. Actions that led to high returns get their probabilities increased; actions with low returns get decreased.

The Policy Gradient Theorem (Sutton et al., 2000):

\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^{\pi_\theta}(s,a)\right]

REINFORCE uses the Monte Carlo return \(G_t = \sum_{k=0}^\infty \gamma^k r_{t+k}\) as an unbiased estimate of \(Q^{\pi_\theta}(s_t, a_t)\):

\theta \leftarrow \theta + \alpha \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t

Adding a baseline \(b(s)\) reduces variance without introducing bias:

\theta \leftarrow \theta + \alpha \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (G_t - b(s_t))

Actor-Critic Methods

Replace the Monte Carlo return with a bootstrapped critic estimate to reduce variance:

\theta \leftarrow \theta + \alpha_\theta \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot \hat{A}(s_t, a_t)

where \(\hat{A}(s,a) = r + \gamma \hat{V}(s';\omega) - \hat{V}(s;\omega)\) is the advantage estimate. Two sets of parameters: \(\theta\) for the actor (policy), \(\omega\) for the critic (value function).

📐 Compatible Function Approximation

To avoid introducing bias from the critic, use a critic that is "compatible" with the actor: \(\hat{Q}(s,a;\omega) = \omega^\top \nabla_\theta \log \pi_\theta(a|s)\). Then the policy gradient theorem holds exactly with this critic. This motivates the Natural Policy Gradient and Fisher information matrix preconditioning.

PPO, TRPO, SAC

Algorithm	Key Idea	Stability Mechanism
TRPO	Trust-region constraint on policy update size	KL divergence constraint
PPO	Clipped surrogate objective, simpler than TRPO	Ratio clipping: clip(ρ, 1-ε, 1+ε)
SAC	Max entropy RL — maximize return + policy entropy	Automatic temperature tuning
TD3	Clipped double Q-learning for continuous actions	Delayed policy updates, target smoothing

§8 Approximate Dynamic Programming

Fitted Q-Iteration (FQI)

FQI is the batch analog of Q-learning. Given a fixed dataset \(\mathcal{D}\), it iteratively applies the Bellman optimality operator and projects back onto the function class:

Q_{k+1} \leftarrow \text{argmin}_{f \in \mathcal{F}} \sum_{(s,a,r,s') \in \mathcal{D}} \left(f(s,a) - \left(r + \gamma \max_{a'} Q_k(s',a')\right)\right)^2

⚠ Accumulation of Approximation Error

Each FQI iteration introduces approximation error \(\epsilon_k\). These errors accumulate. Munos (2003, 2005) showed that the final policy's suboptimality is bounded by \(\frac{C \gamma}{(1-\gamma)^2} \max_k \epsilon_k\), where \(C\) is a concentrability coefficient that captures distributional mismatch.

Batch RL

Batch RL encompasses methods that learn from a fixed dataset without additional interaction. Key algorithms include:

Fitted Q-Iteration (FQI): Regression-based Bellman iteration
LSTD (Least-Squares TD): Direct solution to the linear TD fixed-point equation
LSPI (Least-Squares Policy Iteration): Policy iteration using LSTD for evaluation
NFQ (Neural Fitted Q-Iteration): FQI with neural network function approximator

LSTD: Direct Solution

For linear FA, the TD fixed-point satisfies \(A\theta = b\). LSTD estimates \(A\) and \(b\) from data and solves directly:

\hat{A} = \sum_t \phi(s_t)(\phi(s_t) - \gamma\phi(s_{t+1}))^\top, \quad \hat{b} = \sum_t r_t \phi(s_t)

\theta_{\text{LSTD}} = \hat{A}^{-1}\hat{b}

LSTD is data-efficient (no learning rate) but \(\mathcal{O}(d^2)\) in memory and \(\mathcal{O}(d^3)\) in computation per update. For large feature dimensions, LSTD(λ) or recursive variants are needed.

§9 Offline RL Connection

The Offline RL Problem

Offline (or batch) RL aims to learn the best possible policy from a fixed dataset \(\mathcal{D} = \{(s, a, r, s')\}\) collected by some (possibly unknown) behavior policy \(\mu\), without further environment interaction.

🔑 Core Challenge

Online RL can query the environment to resolve uncertainty. Offline RL cannot. If the dataset lacks coverage of certain state-action regions, the learned Q-function will have uncontrolled extrapolation error in those regions — and greedy policy extraction will exploit these errors catastrophically.

Distribution Shift

The learned policy \(\pi_\theta\) may visit state-action pairs not in the training distribution \(\mu\). Under the learned policy, the true Q-values in out-of-distribution (OOD) regions are unknown, but the model may assign arbitrarily high values there.

Think of the Q-function as a model that's only been trained on some regions of the state-action space. In those regions, it's accurate. In untrained regions, it extrapolates — and neural networks can extrapolate wildly. When we greedily extract a policy, it finds the highest-Q action, which may be in a completely unconstrained OOD region.

This is the "deadly combination" for offline RL: function approximation enables extrapolation, and greedy action selection actively seeks out the highest-valued extrapolations.

Let \(d^\pi\) be the state-action distribution under policy \(\pi\), and \(d^\mu\) under the behavior policy. The concentrability coefficient is:

C^\pi_{\mu} = \left\|\frac{d^\pi}{d^\mu}\right\|_\infty

Suboptimality of an offline-learned policy is bounded by Munos (2003):

V^* - V^{\hat{\pi}} \leq \frac{2\gamma C^\pi_\mu}{(1-\gamma)^2}\sqrt{\frac{\epsilon_{\text{approx}} + \epsilon_{\text{stat}}}{n}}

When \(d^\pi \not\ll d^\mu\) (policy explores OOD regions), \(C^\pi_\mu = \infty\) and the bound is vacuous.

Offline RL Solutions

Approach	Algorithm	Mechanism
Policy Constraint	BCQ, BEAR	Constrain \(\pi\) to stay close to \(\mu\)
Value Pessimism	CQL, IQL	Penalize OOD Q-values during training
Model-Based	MOPO, MOREL	Use model uncertainty as reward penalty
Implicit Constraint	TD3+BC, DT	Behavioral cloning regularization

Conservative Q-Learning (CQL)

CQL (Kumar et al., 2020) adds a regularization term to the standard Q-learning objective that pushes down Q-values for OOD actions while maintaining accuracy for in-distribution ones:

\mathcal{L}_{\text{CQL}}(\theta) = \mathcal{L}_{\text{TD}}(\theta) + \alpha \left[\mathbb{E}_{s \sim \mathcal{D}, a \sim \text{uniform}} \hat{Q}(s,a;\theta) - \mathbb{E}_{(s,a) \sim \mathcal{D}} \hat{Q}(s,a;\theta)\right]

§10 Theory

Approximation vs Estimation Error

The total error in value function learning decomposes into two components:

\underbrace{\|V^\pi - \hat{V}_{\hat{\theta}}\|}_{\text{Total Error}} \leq \underbrace{\|V^\pi - \Pi V^\pi\|}_{\text{Approximation Error (Bias)}} + \underbrace{\|\Pi V^\pi - \hat{V}_{\hat{\theta}}\|}_{\text{Estimation Error (Variance)}}

📉

Approximation Error

Distance from true V* to best function in the class. Reduced by richer function classes. Cannot be reduced by more data — it's irreducible bias.

📊

Estimation Error

Distance from best in-class to learned function. Decreases with more data (∝ 1/√n). Increases with function class complexity (VC dimension, covering number).

Bellman Error Minimization

The Bellman residual or Bellman error at a function \(f\) is:

\text{BE}(f) = \|f - \mathcal{T}^\pi f\|_\mu^2 = \sum_s \mu(s)\left[f(s) - (\mathcal{T}^\pi f)(s)\right]^2

Minimizing BE directly is appealing but faces the double sampling problem: the gradient of BE requires two independent samples from the same state, which is typically unavailable in sequential data.

The gradient of BE w.r.t. \(\theta\) is:

\nabla_\theta \text{BE}(\theta) = 2\mathbb{E}_\mu\left[(\hat{V}(s;\theta) - \mathcal{T}^\pi \hat{V}(s;\theta))\nabla_\theta(\hat{V}(s;\theta) - \mathcal{T}^\pi \hat{V}(s;\theta))\right]

Expanding: \(\nabla_\theta \mathcal{T}^\pi \hat{V}(s;\theta) = \gamma \mathbb{E}_{s'}[\nabla_\theta \hat{V}(s';\theta)]\). Estimating this requires a separate \(s'\) sample from the same \(s\), independent of the one used to estimate \(\mathcal{T}^\pi \hat{V}\). In online RL, we observe only one transition from each state, causing a biased estimator. GTD methods sidestep this via the PBE objective.

Stability: Positive Definiteness of A

For linear TD, the key stability condition is that the matrix \(A = \Phi^\top D(I - \gamma P^\pi)\Phi\) be positive definite. This holds when:

On-policy: \(D\) is the stationary distribution of \(\pi\), which satisfies a balance condition making \(A\) PD.
Off-policy: No guarantee. \(A\) can have negative eigenvalues, causing divergence (Baird's example).

Concentrability Coefficients

Munos (2003, 2005) formalized how the mismatch between training distribution \(\mu\) and target distribution affects error propagation in ADP. The concentrability coefficient:

c(m) = \sup_\pi \left\|\frac{(P^\pi)^m d_0}{d^\mu}\right\|_\infty

measures how quickly the policy occupancy can grow relative to the data distribution. When data lacks coverage, \(c(m)\) is large, and approximation errors compound over policy iterations.

📐 Munos's ADP Error Bound (2003)

\|Q^* - Q^{\hat{\pi}_K}\|^2_{d_0} \leq \frac{\gamma^2 C}{(1-\gamma)^4} \max_{1 \leq k \leq K} \epsilon_k

where \(C = \sum_{m=1}^\infty m \gamma^m c(m)\) accumulates all concentrability over the horizon, \(\epsilon_k\) is the regression error per iteration, and \(\hat{\pi}_K\) is the greedy policy w.r.t. \(Q^K\).

Neural Tangent Kernel (NTK) Perspective

For overparameterized neural networks, training dynamics in the lazy training regime are governed by the Neural Tangent Kernel \(K(s, s') = \nabla_\theta f_\theta(s)^\top \nabla_\theta f_\theta(s')\). This kernel remains approximately constant during training, effectively reducing to linear FA in feature space defined by the network Jacobian.

§11 Modern Advances

Deep RL Scaling

Modern deep RL systems (AlphaZero, MuZero, Gato, RT-2) show that scale matters enormously. Larger networks, more diverse training environments, and longer training lead to emergent capabilities not seen in smaller systems. Key insights:

Transformer-based policies (Decision Transformer, Gato) treat RL as sequence modeling
World models (DreamerV3) enable efficient learning from raw pixels via imagined rollouts
Foundation model pre-training for RL initialization

Representation Learning in RL

The quality of the state representation \(\phi(s)\) fundamentally determines the difficulty of function approximation. Modern approaches to learning representations:

🔮

Auxiliary Tasks

Learn features that predict future states, rewards, or observations (UNREAL, AUX-tasks). Better representations for the main task emerge as a byproduct.

🌀

Bisimulation Metrics

Learn representations where similar behavioral states (same future reward distribution) are metrically close. DBC, MICo, SimPLe.

📡

Self-Supervised

Contrastive and predictive coding methods (CURL, SPR, ATC) pre-train encoders without reward signals.

⚡

Linear Successor Rep.

Successor features decompose Q(s,a) = w⊤ψ(s,a), enabling fast transfer across reward functions.

Offline RL Progress

Offline RL has seen dramatic progress since 2020, enabling learning from large pre-collected datasets:

IQL (Implicit Q-Learning): Avoids evaluating OOD actions entirely via expectile regression
Decision Transformer: Reframes offline RL as conditional sequence modeling
Offline-to-Online: Fine-tuning offline-learned policies with online interaction
Dataset scaling: D4RL, RoboMimic, Bridge Data — larger offline datasets enable better policies

Implicit Regularization in Neural FA

Recent work shows that gradient descent on neural networks in RL has implicit regularization effects. Overparameterized networks may converge to minimum-norm solutions, and the inductive biases of specific architectures (CNNs for spatial tasks, Transformers for sequential tasks) can be understood through the lens of function approximation.

§12 Historical Timeline

1957

Bellman Equations & Dynamic Programming

Richard Bellman establishes the mathematical foundation of sequential decision making. Value functions and the Bellman optimality principle are born.

1988

TD Learning (Sutton)

Temporal difference learning introduced. TD(λ) unifies Monte Carlo and DP methods via eligibility traces. First formal treatment of learning with FA.

1989

Q-Learning (Watkins)

Model-free off-policy control algorithm introduced. Tabular convergence proven. Foundation for modern deep RL.

1992

TD-Gammon (Tesauro)

First impressive demonstration of neural FA in RL — learns backgammon at grandmaster level using TD(0) with a neural network.

1995

Baird's Counterexample

Shows that off-policy TD with linear FA can diverge. Formalizes the Deadly Triad. Motivates decades of research into stable algorithms.

1997

Convergence of Linear TD (Tsitsiklis & Van Roy)

Proves that on-policy TD(0) with linear FA converges almost surely, and characterizes the solution as within 1/(1-γ) of optimal in class.

2003

Error Propagation in ADP (Munos)

Provides first rigorous error bounds for approximate policy iteration with concentrability coefficients. Framework for understanding offline RL.

2009

Gradient TD (Sutton et al.)

GTD2 and TDC — true gradient methods for off-policy learning. Convergent under off-policy sampling via two-timescale analysis.

2013–2015

Deep Q-Network (DQN, Mnih et al.)

Experience replay + target networks enable stable Q-learning with CNNs. Human-level performance on 49 Atari games. Ignites the deep RL revolution.

2016

AlphaGo / A3C

Deep policy gradient methods at scale. Asynchronous training enables large parallelism. Policy gradient enters the mainstream.

2018–2020

Offline RL Surge

BCQ, BEAR, CQL — offline RL with function approximation. D4RL benchmark establishes standardized evaluation.

2021–2025

Scaling & Foundation Models

Decision Transformer, Gato, RT-2, DreamerV3. RL meets large-scale pretraining. Function approximation with billions of parameters.

Summary

🗺 The Big Picture

Function approximation is the bridge between tabular RL theory and real-world applicability. The core tension: richer function classes reduce approximation error but increase estimation error and destabilize training. The Deadly Triad (FA + bootstrapping + off-policy) is the central challenge. Modern deep RL navigates this through architectural choices (target networks, replay), algorithmic design (CQL, PPO), and implicit regularization. Theory increasingly lags practice, especially for neural FA, making this an active and exciting research frontier.

Start with Linear FA

Strong theoretical guarantees. Understand the geometry before going nonlinear.

Know the Deadly Triad

Any practical system must address FA + bootstrapping + off-policy. No free lunch.

Representation Matters

The feature map φ(s) determines approximation error. Invest in representation.

Offline RL is Hard

Distribution shift + OOD extrapolation require explicit pessimism or constraints.

References

[1]

Reinforcement Learning: An Introduction (2nd ed.)

Sutton, R.S. & Barto, A.G. (2018). MIT Press. — The canonical textbook. Chapters 9–13 cover function approximation thoroughly.
[2]

An Analysis of Temporal-Difference Learning with Function Approximation

Tsitsiklis, J.N. & Van Roy, B. (1997). IEEE Transactions on Automatic Control. — Proves convergence of linear TD under on-policy sampling.
[3]

Error Bounds for Approximate Policy Iteration

Munos, R. (2003). ICML. — Introduces concentrability coefficients and error propagation bounds for ADP.
[4]

Fast Gradient-Descent Methods for Temporal-Difference Learning

Sutton, R.S., Maei, H.R., et al. (2009). ICML. — GTD2 and TDC: true gradient methods for off-policy learning.
[5]

Human-Level Control through Deep Reinforcement Learning

Mnih, V., et al. (2015). Nature. — DQN: experience replay + target networks enable deep Q-learning.
[6]

Reinforcement Learning: Theory and Algorithms

Agarwal, A., Jiang, N., Kakade, S., & Sun, W. (2022). — Modern theoretical treatment with sample complexity guarantees.
[7]

Conservative Q-Learning for Offline Reinforcement Learning

Kumar, A., et al. (2020). NeurIPS. — CQL: pessimistic Q-learning for offline settings.
[8]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives

Levine, S., Kumar, A., Tucker, G., & Fu, J. (2020). arXiv. — Comprehensive survey of offline RL methods and theory.
[9]

Policy Gradient Methods for Reinforcement Learning with Function Approximation

Sutton, R.S., McAllester, D., Singh, S., & Mansour, Y. (2000). NeurIPS. — The Policy Gradient Theorem.
[10]

Mastering Atari, Go, Chess and Shogi without Rules (MuZero)

Schrittwieser, J., et al. (2020). Nature. — Deep model-based RL at scale with learned world models.